DataFrameGroupby.boxplot fails when subplots=False #28102

charlesdong1991 · 2019-08-22T20:37:07Z

closes DataFrameGroupBy.boxplot with subplots=False fails when using column param #16748
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

charlesdong1991 · 2019-08-22T21:14:24Z

is there a flaky test? 🤔 a lot other PRs passed tests though

doc/source/whatsnew/v1.0.0.rst

pandas/plotting/_matplotlib/boxplot.py

charlesdong1991 · 2019-08-23T18:34:07Z

would appreciate a lot if someone could give some hint of why the failure happens at test_converter.py.

charlesdong1991 · 2019-08-26T17:00:42Z

Would appreciate a lot if you could give any hint on how test_converter could fail the tests? @TomAugspurger i tried to set it to xfail to avoid failures

pandas/tests/plotting/test_converter.py

pandas/plotting/_matplotlib/boxplot.py

TomAugspurger · 2019-08-30T20:53:01Z

Just to make sure I understand, is there any API change here for an end-user? Is this code called indirectly via DataFrame.boxplot? I want to make sure there are no changes for that, when the user is providing a MultiIndexed dataframe and columns.

charlesdong1991 · 2019-08-31T20:17:37Z

You are absolutely right, this indeed caused a change on MultiIndexed dataframe and columns when using DataFrame.boxplot. But I test again and somehow feel it might be good change to end users. Since right now, for MultiIndexed dataframe, if you have a dataset like:

WHen you do `df.boxplot(column='two'), then an error will be raised:

But with the new change, this can be plotted, and I feel this change is more useful and users might like ti.

In the meanwhile, due to this, i found out actually DataFrame.boxplot with multiindex Dataframe are not tested at all at pandas, otherwise, this should raise an error. So I also add another test for this case, the test is exactly the same as I showed in the picture below.

Thanks again for your careful and thorough thought on the potential consequences, I am looking forward to further reviews. @TomAugspurger

TomAugspurger · 2019-09-03T16:00:34Z

But with the new change, this can be plotted, and I feel this change is more useful and users might like ti.

The alternative would be for the user to provide something like columns=pd.IndexSlice[:, 'two'], correct?

Just to make sure, there's no change to the case where the user-provided columns is from the outer level like bar?

My main concern is that by automatically doing this, we'd introduce some kind of ambiguity about what columns means.

charlesdong1991 · 2019-09-05T18:42:33Z

Sorry for my late response, was out sick for a couple days.
I read your comments again and think I fully get your point. Indeed, this will also change the current behavior from the outer level like bar. Right now, if doing df.boxplot(columns='bar'), will get:

You mean we should keep this behavior, right? @TomAugspurger

But i might have to say behaviour of df.boxplot(columns='two') will change in this case. Before the change of this PR, this would raise a KeyError of ['two'] not in index. And now it plots columns containing two because of this pd.IndexSlice.

And I also have a second thought, and think we might provide another argument for boxplot, which is something like include (could be changed for something else) for multiindex columns, if True, then users could define more ambiguous columns to get their columns plotted, so columns='bar' will plot all columns as long as 'bar' is in any multi-level index. And for normal Index, just set to False in the function no matter what, since this won't affect normal index at all.

jreback · 2019-10-06T22:46:01Z

@charlesdong1991 can you merge master; can you summarize the issues here that were raised?

jreback · 2019-10-18T21:33:29Z

can you merge master

charlesdong1991 · 2020-02-02T08:12:29Z

thanks a lot for picking it up @WillAyd Yeah, I am active, so is this PR! xD
I think there is a design issue with this PR, and now the priority is regression and new issues posted for 1.0

I will get back to your comments later next week, since I also came across an issue which requires a feature for plotting based on the Index level, might be relevant to this PR, I will try to get that feature done and then back to address and try to push it through

WillAyd · 2020-03-27T21:54:27Z

Can you move the whatsnew?

simonjayhawkins · 2020-05-08T16:43:11Z

@charlesdong1991 what's the status here?

charlesdong1991 · 2020-05-12T06:19:25Z

yeah, i will give this PR another try later this week @simonjayhawkins

jreback · 2020-06-20T15:58:36Z

whats the status here?

jbrockmendel · 2020-08-30T01:18:09Z

@charlesdong1991 gentle ping can you rebase

github-actions · 2020-10-03T00:15:00Z

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

charlesdong1991 · 2020-10-03T11:04:38Z

hey @jreback @WillAyd @simonjayhawkins @TomAugspurger

sorry for very late update here. I just took a look at the original issue and my past commits together with the relevant codebase, I feel it was quite inefficient and confusing. So here, I flipped most of the changes and I think now the new solution looks much better.

So the issue was with a df like below, it will fail doing df.groupby.boxplot(subplots=False, column='v'):

df = pd.DataFrame({'cat':np.random.choice(list('abcde'), 100), 
                    'v':np.random.rand(100), 
                    'v1':np.random.rand(100)})
df.groupby('cat').boxplot(subplots=False, column='v')

and it is because the data for plotting after df.groupby('cat') has changed to MI, so v no longer exists in the transformed data.

Therefore, the new solution for this is quite simple, I couple the keys of groupby values (in this case, [a, b, c, d, e]) with the column value sellers assign to column argument (in this case, v), so we have [(a, v), (b,v), (c, v), (d, v), (e, v)] and assign them to boxplot function, and then boxplot function will look for subset based on this new column values, instead of v in the original df which is used by sellers.

How does it sound now? Feedbacks and reviews are very welcomed!

~~Regarding the error, It seems irrelevant to this PR.~~
EDIT:
CI failure is gone now after commiting one more time

jreback · 2020-10-10T23:12:11Z

@charlesdong1991 this looks good, can you post an updated plot (and your tests look nice); merge master and ping on green-ish..

charlesdong1991 · 2020-10-13T18:10:05Z

sure, these are plots with and without specifying column, I also update in the PR description on top.

jreback · 2020-11-04T03:01:18Z

@charlesdong1991 can you merge master

charlesdong1991 · 2020-11-07T06:50:54Z

small ping here @jreback sorry for the late and long response

jreback · 2020-11-08T03:03:16Z

thanks @charlesdong1991 very nice!

… (#37655) * Moving the file test_frame.py to a new directory * Сreated file test_frame_color.py * Transfer tests of test_frame.py to test_frame_color.py * PEP 8 fixes * Transfer tests of test_frame.py to test_frame_groupby.py and test_frame_subplots.py * Removing unnecessary imports * PEP 8 fixes * Fixed class name * Transfer tests of test_frame.py to test_frame_subplots.py * Transfer tests of test_frame.py to test_frame_groupby.py, test_frame_subplots.py, test_frame_color.py * Changed class names * Removed unnecessary imports * Removed import * catch FutureWarnings (#37587) * TST/REF: collect indexing tests by method (#37590) * REF: prelims for single-path setitem_with_indexer (#37588) * ENH: __repr__ for 2D DTA/TDA (#37164) * CLN: de-duplicate _validate_where_value with _validate_setitem_value (#37595) * TST/REF: collect tests by method (#37589) * TST/REF: move remaining setitem tests from test_timeseries * TST/REF: rehome test_timezones test * move misplaced arithmetic test * collect tests by method * move misplaced file * REF: Categorical.is_dtype_equal -> categories_match_up_to_permutation (#37545) * CLN refactor non-core (#37580) * refactor core/computation (#37585) * TST/REF: share method tests between DataFrame and Series (#37596) * BUG: Index.where casting ints to str (#37591) * REF: IntervalArray comparisons (#37124) * regression fix for merging DF with datetime index with empty DF (#36897) * ERR: fix error message in Period for invalid frequency (#37602) * CLN: remove rebox_native (#37608) * TST/REF: tests.generic (#37618) * TST: collect tests by method (#37617) * TST/REF: collect test_timeseries tests by method * misplaced DataFrame.values tst * misplaced dataframe.values test * collect test by method * TST/REF: share tests across Series/DataFrame (#37616) * Gh 36562 typeerror comparison not supported between float and str (#37096) * docs: fix punctuation (#37612) * REGR: pd.to_hdf(..., dropna=True) not dropping missing rows (#37564) * parametrize set_axis tests (#37619) * CLN: clean color selection in _matplotlib/style (#37203) * DEPR: DataFrame/Series.slice_shift (#37601) * REF: re-use validate_setitem_value in Categorical.fillna (#37597) * PERF: release gil for ewma_time (#37389) * BUG: Groupy dropped nan groups from result when grouping over single column (#36842) * ENH: implement timeszones support for read_json(orient='table') and astype() from 'object' (#35973) * REF/BUG/TYP: read_csv shouldn't close user-provided file handles (#36997) * BUG/REF: read_csv shouldn't close user-provided file handles * get_handle: typing, returns is_wrapped, use dataclass, and make sure that all created handlers are returned * remove unused imports * added IOHandleArgs.close * added IOArgs.close * mostly comments * move memory_map from TextReader to CParserWrapper * moved IOArgs and IOHandles * more comments Co-authored-by: Jeff Reback <jeff@reback.net> * more typing checks to pre-commit (#37539) * TST: 32bit dtype compat test_groupby_dropna (#37623) * BUG: Metadata propagation for groupby iterator (#37461) * BUG: read-only values in cython funcs (#37613) * CLN refactor core/arrays (#37581) * Fixed Metadata Propogation in DataFrame (#37381) * TYP: add Shape alias to pandas._typing (#37128) * DOC: Fix typo (#37630) * CLN: parametrize test_nat_comparisons (#37195) * dataframe dataclass docstring updated (#37632) * refactor core/groupby (#37583) * BUG: set index of DataFrame.apply(f) when f returns dict (#37544) (#37606) * BUG: to_dict should return a native datetime object for NumPy backed dataframes (#37571) * ENH: memory_map for compressed files (#37621) * DOC: add example & prose of slicing with labels when index has duplicate labels (#36814) * DOC: add example & prose of slicing with labels when index has duplicate labels #36251 * DOC: proofread the sentence. Co-authored-by: Jun Kudo <jun-lab@junnoMacBook-Pro.local> * DOC: Fix typo (#37636) "columns(s)" sounded odd, I believe it was supposed to be just "column(s)". * CI: troubleshoot win py38 builds (#37652) * TST/REF: collect indexing tests by method (#37638) * TST/REF: collect tests for get_numeric_data (#37634) * misplaced loc test * TST/REF: collect get_numeric_data tests * REF: de-duplicate _validate_insert_value with _validate_scalar (#37640) * CI: catch windows py38 OSError (#37659) * share test (#37679) * TST: match matplotlib warning message (#37666) * TST: match matplotlib warning message * TST: match full message * pd.Series.loc.__getitem__ promotes to float64 instead of raising KeyError (#37687) * REF/TST: misplaced Categorical tests (#37678) * REF/TST: collect indexing tests by method (#37677) * CLN: only call _wrap_results one place in nanmedian (#37673) * TYP: Index._concat (#37671) * BUG: CategoricalIndex.equals casting non-categories to np.nan (#37667) * CLN: _replace_single (#37683) * REF: simplify _replace_single by noting regex kwarg is bool * Annotate * CLN: remove never-False convert kwarg * TYP: make more internal funcs keyword-only (#37688) * REF: make Series._replace_single a regular method (#37691) * REF: simplify cycling through colors (#37664) * REF: implement _wrap_reduction_result (#37660) * BUG: preserve fold in Timestamp.replace (#37644) * CLN: Clean indexing tests (#37689) * TST: fix warning for pie chart (#37669) * PERF: reverted change from commit 7d257c6 to solve issue #37081 (#37426) * DataFrameGroupby.boxplot fails when subplots=False (#28102) * ENH: Improve error reporting for wrong merge cols (#37547) * Transfer tests of test_frame.py to test_frame_color.py * PEP8 * Fixes for linter * Сhange pd.DateFrame to DateFrame * Move inconsistent namespace check to pre-commit, fixup more files (#37662) * check for inconsistent namespace usage * doc * typos * verbose regex * use verbose flag * use verbose flag * match both directions * add test * don't import annotations from future * update extra couple of cases * 🚚 rename * typing * don't use subprocess * don't type tests * use pathlib * REF: simplify NDFrame.replace, ObjectBlock.replace (#37704) * REF: implement Categorical.encode_with_my_categories (#37650) * REF: implement Categorical.encode_with_my_categories * privatize * BUG: unpickling modifies Block.ndim (#37657) * REF: dont support dt64tz in nanmean (#37658) * CLN: Simplify groupby head/tail tests (#37702) * Bug in loc raised for numeric label even when label is in Index (#37675) * REF: implement replace_regex, remove unreachable branch in ObjectBlock.replace (#37696) * TYP: Check untyped defs (except vendored) (#37556) * REF: remove ObjectBlock._replace_single (#37710) * Transfer tests of test_frame.py to test_frame_color.py * TST/REF: collect indexing tests by method (#37590) * PEP8 * Сhange DateFrame to pd.DateFrame * Сhange pd.DateFrame to DateFrame * Removing imports * Bug fixes * Bug fixes * Fix incorrect merge * test_frame_color.py edit * Transfer tests of test_frame.py to test_frame_color.py, test_frame_groupby.py and test_frame_subplots.py * Removing unnecessary imports * PEP8 * # Conflicts: # pandas/tests/plotting/frame/test_frame.py # pandas/tests/plotting/frame/test_frame_color.py # pandas/tests/plotting/frame/test_frame_subplots.py * Moving the file test_frame.py to a new directory * Transfer tests of test_frame.py to test_frame_color.py, test_frame_groupby.py and test_frame_subplots.py * Removing unnecessary imports * PEP8 * CLN: clean categorical indexes tests (#37721) * Fix merge error * PEP 8 fixes * Fix merge error * Removing unnecessary imports * PEP 8 fixes * Fixed class name * Transfer tests of test_frame.py to test_frame_subplots.py * Transfer tests of test_frame.py to test_frame_groupby.py, test_frame_subplots.py, test_frame_color.py * Changed class names * Removed unnecessary imports * Removed import * TST/REF: collect indexing tests by method (#37590) * TST: match matplotlib warning message (#37666) * TST: match matplotlib warning message * TST: match full message * TST: fix warning for pie chart (#37669) * Transfer tests of test_frame.py to test_frame_color.py * PEP8 * Fixes for linter * Сhange pd.DateFrame to DateFrame * Transfer tests of test_frame.py to test_frame_color.py * PEP8 * Сhange DateFrame to pd.DateFrame * Сhange pd.DateFrame to DateFrame * Removing imports * Bug fixes * Bug fixes * Fix incorrect merge * test_frame_color.py edit * Fix merge error * Fix merge error * Removing unnecessary features * Resolving Commit Conflicts daf999f 365d843 * black fix Co-authored-by: jbrockmendel <jbrockmendel@gmail.com> Co-authored-by: Marco Gorelli <m.e.gorelli@gmail.com> Co-authored-by: Philip Cerles <philip.cerles@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Sven <sven.schellenberg@paradynsystems.com> Co-authored-by: Micael Jarniac <micael@jarniac.com> Co-authored-by: Andrew Wieteska <48889395+arw2019@users.noreply.github.com> Co-authored-by: Maxim Ivanov <41443370+ivanovmg@users.noreply.github.com> Co-authored-by: Erfan Nariman <34067903+erfannariman@users.noreply.github.com> Co-authored-by: Fangchen Li <fangchen.li@outlook.com> Co-authored-by: patrick <61934744+phofl@users.noreply.github.com> Co-authored-by: attack68 <24256554+attack68@users.noreply.github.com> Co-authored-by: Torsten Wörtwein <twoertwein@users.noreply.github.com> Co-authored-by: Jeff Reback <jeff@reback.net> Co-authored-by: Janus <janus@insignificancegalore.net> Co-authored-by: Joel Whittier <rootbeerfriend@gmail.com> Co-authored-by: taytzehao <jtth95@gmail.com> Co-authored-by: ma3da <34522496+ma3da@users.noreply.github.com> Co-authored-by: junk <juntrp0207@gmail.com> Co-authored-by: Jun Kudo <jun-lab@junnoMacBook-Pro.local> Co-authored-by: Alex Kirko <alexander.kirko@gmail.com> Co-authored-by: Yassir Karroum <ukarroum17@gmail.com> Co-authored-by: Kaiqi Dong <kaiqi@kth.se> Co-authored-by: Richard Shadrach <45562402+rhshadrach@users.noreply.github.com> Co-authored-by: Simon Hawkins <simonjayhawkins@gmail.com>

charlesdong1991 added 5 commits December 3, 2018 17:43

remove \n from docstring

7e461a1

fix conflicts

1314059

Merge remote-tracking branch 'upstream/master'

8bcb313

Merge remote-tracking branch 'upstream/master' into fix_issue_16748

a30fd5c

Fix issue 16748

1d0ac65

gfyoung added Groupby Visualization plotting labels Aug 23, 2019

gfyoung reviewed Aug 23, 2019

View reviewed changes

doc/source/whatsnew/v1.0.0.rst Outdated Show resolved Hide resolved

gfyoung reviewed Aug 23, 2019

View reviewed changes

pandas/plotting/_matplotlib/boxplot.py Outdated Show resolved Hide resolved

charlesdong1991 added 3 commits August 23, 2019 08:41

Code change based on review

af41084

Fix import sort linting

193eb2c

Merge remote-tracking branch 'upstream/master' into fix_issue_16748

db214b6

Skip the failing test

dfc72b2

Merge remote-tracking branch 'upstream/master' into fix_issue_16748

6cd2d28

TomAugspurger reviewed Aug 30, 2019

View reviewed changes

pandas/tests/plotting/test_converter.py Outdated Show resolved Hide resolved

Remove skip

5c69d10

TomAugspurger reviewed Aug 30, 2019

View reviewed changes

pandas/plotting/_matplotlib/boxplot.py Outdated Show resolved Hide resolved

remove imports

c08c278

More careful change

1df91da

charlesdong1991 added 2 commits September 5, 2019 21:11

fix conflict

24c5d93

keep the change

9cce7f7

charlesdong1991 marked this pull request as draft September 1, 2020 20:07

github-actions bot added the Stale label Oct 3, 2020

charlesdong1991 added 4 commits October 3, 2020 11:54

much better solution

c786a55

format

a1d84b9

typo

4eecae8

whatsnew

b6d1b4c

charlesdong1991 removed the Stale label Oct 3, 2020

charlesdong1991 marked this pull request as ready for review October 3, 2020 10:24

commit one more

8ab2db3

charlesdong1991 requested a review from WillAyd October 7, 2020 11:55

jreback added this to the 1.2 milestone Oct 10, 2020

charlesdong1991 added 3 commits October 13, 2020 20:10

Merge remote-tracking branch 'upstream/master' into fix_issue_16748

9f5caaa

Merge remote-tracking branch 'upstream/master' into fix_issue_16748

6bf3914

Merge remote-tracking branch 'upstream/master' into fix_issue_16748

a2884e7

Merge remote-tracking branch 'upstream/master' into fix_issue_16748

067323d

jreback merged commit 5fd478d into pandas-dev:master Nov 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFrameGroupby.boxplot fails when subplots=False #28102

DataFrameGroupby.boxplot fails when subplots=False #28102

charlesdong1991 commented Aug 22, 2019 •

edited

Loading

charlesdong1991 commented Aug 22, 2019 •

edited

Loading

charlesdong1991 commented Aug 23, 2019

charlesdong1991 commented Aug 26, 2019 •

edited

Loading

TomAugspurger commented Aug 30, 2019

charlesdong1991 commented Aug 31, 2019 •

edited

Loading

TomAugspurger commented Sep 3, 2019

charlesdong1991 commented Sep 5, 2019 •

edited

Loading

jreback commented Oct 6, 2019

jreback commented Oct 18, 2019

charlesdong1991 commented Feb 2, 2020

WillAyd commented Mar 27, 2020

simonjayhawkins commented May 8, 2020

charlesdong1991 commented May 12, 2020

jreback commented Jun 20, 2020

jbrockmendel commented Aug 30, 2020

github-actions bot commented Oct 3, 2020

charlesdong1991 commented Oct 3, 2020 •

edited

Loading

jreback commented Oct 10, 2020

charlesdong1991 commented Oct 13, 2020 •

edited

Loading

jreback commented Nov 4, 2020

charlesdong1991 commented Nov 7, 2020

jreback commented Nov 8, 2020

DataFrameGroupby.boxplot fails when subplots=False #28102

DataFrameGroupby.boxplot fails when subplots=False #28102

Conversation

charlesdong1991 commented Aug 22, 2019 • edited Loading

charlesdong1991 commented Aug 22, 2019 • edited Loading

charlesdong1991 commented Aug 23, 2019

charlesdong1991 commented Aug 26, 2019 • edited Loading

TomAugspurger commented Aug 30, 2019

charlesdong1991 commented Aug 31, 2019 • edited Loading

TomAugspurger commented Sep 3, 2019

charlesdong1991 commented Sep 5, 2019 • edited Loading

jreback commented Oct 6, 2019

jreback commented Oct 18, 2019

charlesdong1991 commented Feb 2, 2020

WillAyd commented Mar 27, 2020

simonjayhawkins commented May 8, 2020

charlesdong1991 commented May 12, 2020

jreback commented Jun 20, 2020

jbrockmendel commented Aug 30, 2020

github-actions bot commented Oct 3, 2020

charlesdong1991 commented Oct 3, 2020 • edited Loading

jreback commented Oct 10, 2020

charlesdong1991 commented Oct 13, 2020 • edited Loading

jreback commented Nov 4, 2020

charlesdong1991 commented Nov 7, 2020

jreback commented Nov 8, 2020

charlesdong1991 commented Aug 22, 2019 •

edited

Loading

charlesdong1991 commented Aug 22, 2019 •

edited

Loading

charlesdong1991 commented Aug 26, 2019 •

edited

Loading

charlesdong1991 commented Aug 31, 2019 •

edited

Loading

charlesdong1991 commented Sep 5, 2019 •

edited

Loading

charlesdong1991 commented Oct 3, 2020 •

edited

Loading

charlesdong1991 commented Oct 13, 2020 •

edited

Loading