Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Fix unexpected sort in groupby #17621

Merged
merged 4 commits into from
Oct 1, 2017

Conversation

Licht-T
Copy link
Contributor

@Licht-T Licht-T commented Sep 22, 2017

@Licht-T Licht-T changed the title Fix unexpected sort groupby BUG: Fix unexpected sort in groupby Sep 22, 2017
@codecov
Copy link

codecov bot commented Sep 22, 2017

Codecov Report

Merging #17621 into master will decrease coverage by 0.03%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17621      +/-   ##
==========================================
- Coverage    91.2%   91.16%   -0.04%     
==========================================
  Files         163      163              
  Lines       49637    49643       +6     
==========================================
- Hits        45269    45259      -10     
- Misses       4368     4384      +16
Flag Coverage Δ
#multiple 88.95% <100%> (-0.02%) ⬇️
#single 40.18% <0%> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/core/groupby.py 92.24% <100%> (+0.02%) ⬆️
pandas/core/generic.py 91.98% <100%> (ø) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/core/indexes/multi.py 96.39% <0%> (-0.51%) ⬇️
pandas/core/indexes/category.py 98.26% <0%> (-0.29%) ⬇️
pandas/core/frame.py 97.77% <0%> (-0.1%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8276a42...1e13713. Read the comment docs.

@codecov
Copy link

codecov bot commented Sep 22, 2017

Codecov Report

Merging #17621 into master will decrease coverage by 0.05%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #17621      +/-   ##
==========================================
- Coverage   91.27%   91.21%   -0.06%     
==========================================
  Files         163      163              
  Lines       49765    49770       +5     
==========================================
- Hits        45421    45399      -22     
- Misses       4344     4371      +27
Flag Coverage Δ
#multiple 89.01% <100%> (-0.04%) ⬇️
#single 40.32% <0%> (-0.07%) ⬇️
Impacted Files Coverage Δ
pandas/core/generic.py 92.07% <100%> (ø) ⬆️
pandas/core/groupby.py 92.25% <100%> (+0.01%) ⬆️
pandas/io/gbq.py 25% <0%> (-58.34%) ⬇️
pandas/plotting/_converter.py 63.38% <0%> (-1.82%) ⬇️
pandas/core/indexes/multi.py 96.39% <0%> (-0.51%) ⬇️
pandas/core/indexes/category.py 97.46% <0%> (-0.29%) ⬇️
pandas/core/frame.py 97.73% <0%> (-0.1%) ⬇️
pandas/core/indexes/base.py 96.34% <0%> (+0.05%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ad7d051...9b6a3da. Read the comment docs.

@@ -2613,6 +2613,13 @@ def _get_grouper(obj, key=None, axis=0, level=None, sort=True,

level = None
key = group_axis
elif key is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hate adding logic here, this function is already impenetrable, can you incorporate this to existing?

assert_frame_equal(result0, expected0)
assert_frame_equal(result1, expected1)

# axis=1

result0 = frame.T.groupby(level=0, axis=1).sum()
result1 = frame.T.groupby(level=1, axis=1).sum()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for a couple of these that you changed can you also add the sort=True case (maybe parametrize on sort=)?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls add a whatsnew note

@jreback jreback added Bug Groupby MultiIndex Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Sep 22, 2017
@pep8speaks
Copy link

pep8speaks commented Sep 22, 2017

Hello @Licht-T! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on September 29, 2017 at 11:42 Hours UTC

@Licht-T Licht-T force-pushed the fix-unexpected-sort-groupby branch 2 times, most recently from 576a6cc to 5567ac1 Compare September 22, 2017 15:24
@Licht-T
Copy link
Contributor Author

Licht-T commented Sep 22, 2017

@jreback Thanks for your review.

  • Changed the location of the solution to improve readability
  • Parameterized sort parameter of groupby in tests

@@ -2626,6 +2626,14 @@ def _get_grouper(obj, key=None, axis=0, level=None, sort=True,
elif isinstance(key, BaseGrouper):
return key, [], obj

if key is None and isinstance(group_axis, MultiIndex):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you simplify this. e.g. maybe put this coercion higher (before the giant if/then)

@@ -1791,18 +1791,19 @@ def aggfun(ser):
agged2 = df.groupby(keys).aggregate(aggfun)
assert len(agged2.columns) + 1 == len(df.columns)

def test_groupby_level(self):
@pytest.mark.parametrize('sort', [True, False])
def test_groupby_level(self, sort):
frame = self.mframe
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the issue number here as well as a comment

@Licht-T
Copy link
Contributor Author

Licht-T commented Sep 22, 2017

@jreback Thank you for comments. Fixed.

@@ -2586,6 +2586,15 @@ def _get_grouper(obj, key=None, axis=0, level=None, sort=True,
"""
group_axis = obj._get_axis(axis)

if key is None and level is not None and \
isinstance(group_axis, MultiIndex):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so maybe move this down a bit (under level is not None). I don't want this to be a bespoke condition. I think you can remove the isinstance check of MultiIndex.

level = level[0]

if is_scalar(level):
key = group_axis.get_level_values(level)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put a comment on what is going on here. maybe we can incorporate this below as well.

I am trying to remove as many special cases as possible.

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Licht-T this looks pretty close. Could you fixup the merge conflict and take a look at @jreback's comments in the next couple days?

@@ -538,6 +538,7 @@ Groupby/Resample/Rolling
- Bug in ``Series.resample(...).apply()`` where an empty ``Series`` modified the source index and did not return the name of a ``Series`` (:issue:`14313`)
- Bug in ``.rolling(...).apply(...)`` with a ``DataFrame`` with a ``DatetimeIndex``, a ``window`` of a timedelta-convertible and ``min_periods >= 1` (:issue:`15305`)
- Bug in ``DataFrame.groupby`` where index and column keys were not recognized correctly when the number of keys equaled the number of elements on the groupby axis (:issue:`16859`)
- Bug in ``DataFrame.groupby`` where the single level selection from ``MultiIndex`` occurs unexpected index sorting (:issue:`17537`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"occurs" -> "incurs"? Or maybe "causes"?

@TomAugspurger TomAugspurger added this to the 0.21.0 milestone Sep 25, 2017
@Licht-T
Copy link
Contributor Author

Licht-T commented Sep 26, 2017

@TomAugspurger Okay. I'll do that.

@@ -538,6 +538,7 @@ Groupby/Resample/Rolling
- Bug in ``Series.resample(...).apply()`` where an empty ``Series`` modified the source index and did not return the name of a ``Series`` (:issue:`14313`)
- Bug in ``.rolling(...).apply(...)`` with a ``DataFrame`` with a ``DatetimeIndex``, a ``window`` of a timedelta-convertible and ``min_periods >= 1` (:issue:`15305`)
- Bug in ``DataFrame.groupby`` where index and column keys were not recognized correctly when the number of keys equaled the number of elements on the groupby axis (:issue:`16859`)
- Bug in ``DataFrame.groupby`` where the single level selection from ``MultiIndex`` occurs unexpected index sorting (:issue:`17537`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where a single level selection from a MultiIndex unexpectedly sorts.

# axis of the object
if level is not None:
if not isinstance(group_axis, MultiIndex):
# TODO: These two conditions are almost same.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok for now. can you come back in a future PR and see what we can do with all the conditions in this section. getting pretty unweildy (and document as much as possible).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback Okay. I'll do that. These are too complicated to do refactoring in this PR, I think.

# TODO: These two conditions are almost same.
# We should combine two.
if isinstance(group_axis, MultiIndex):
if is_list_like(level) and len(level) == 1:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually this condition I think you can pull out of the MultiIndex check here (as the else is the same condition)

Copy link
Contributor Author

@Licht-T Licht-T Sep 27, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback I am aware of this, but it seems that there are some processes only for non-MultiIndex in else. We have to consider carefully whether these are applicable for MultiIndex.
https://github.com/pandas-dev/pandas/pull/17621/files/e4cdd0726e685b0216056ba224ed363bf1e836f9#diff-720d374f1a709d0075a1f0a02445cd65R2618

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When these are applicable, we also have to check if there is no side effect to subsequent processes.

@jreback
Copy link
Contributor

jreback commented Sep 28, 2017

can you rebase

@Licht-T
Copy link
Contributor Author

Licht-T commented Sep 28, 2017

@jreback Rebased.

@jreback
Copy link
Contributor

jreback commented Sep 29, 2017

@Licht-T can you rebase and push once again, want to get all green here.

@Licht-T
Copy link
Contributor Author

Licht-T commented Sep 29, 2017

@jreback Now all green!

@jreback jreback merged commit fd336fb into pandas-dev:master Oct 1, 2017
@jreback
Copy link
Contributor

jreback commented Oct 1, 2017

thanks @Licht-T

ghost pushed a commit to reef-technologies/pandas that referenced this pull request Oct 2, 2017
* 'master' of github.com:pandas-dev/pandas: (188 commits)
  Separate out _convert_datetime_to_tsobject (pandas-dev#17715)
  DOC: remove whatsnew note for xref pandas-dev#17131
  BUG: Regression in .loc accepting a boolean Index as an indexer (pandas-dev#17738)
  DEPR: Deprecate cdate_range and merge into bdate_range (pandas-dev#17691)
  CLN: replace %s syntax with .format in pandas.core: categorical, common, config, config_init (pandas-dev#17735)
  Fixed the memory usage explanation of categorical in gotchas from O(nm) to O(n+m) (pandas-dev#17736)
  TST: add backward compat for offset testing for pickles (pandas-dev#17733)
  remove unused time conversion funcs (pandas-dev#17711)
  DEPR: Deprecate convert parameter in take (pandas-dev#17352)
  BUG:Time Grouper bug fix when applied for list groupers (pandas-dev#17587)
  BUG: Fix some PeriodIndex resampling issues (pandas-dev#16153)
  BUG: Fix unexpected sort in groupby (pandas-dev#17621)
  DOC: Fixed typo in documentation for 'pandas.DataFrame.replace' (pandas-dev#17731)
  BUG: Fix series rename called with str altering name rather index (GH17407) (pandas-dev#17654)
  DOC: Add examples for MultiIndex.get_locs + cleanups (pandas-dev#17675)
  Doc improvements for IntervalIndex and Interval (pandas-dev#17714)
  BUG: DataFrame sort_values and multiple "by" columns fails to order NaT correctly
  Last of the timezones funcs (pandas-dev#17669)
  Add missing file to _pyxfiles, delete commented-out (pandas-dev#17712)
  update imports of DateParseError, remove unused imports from tslib (pandas-dev#17713)
  ...
alanbato pushed a commit to alanbato/pandas that referenced this pull request Nov 10, 2017
No-Stream pushed a commit to No-Stream/pandas that referenced this pull request Nov 28, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Groupby MultiIndex Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: df.groupby(sort=False) sorts multi-index-frames
4 participants