Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: pivot/groupby index with nan (GH3729) #21669

Closed
wants to merge 1 commit into from

Conversation

andrewdalecramer
Copy link

@andrewdalecramer andrewdalecramer commented Jun 29, 2018

Adds the ability to use NaN as a grouping variable. Required allowing NA as a factorisation variable in factorize().

@codecov
Copy link

codecov bot commented Jun 29, 2018

Codecov Report

Merging #21669 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #21669      +/-   ##
==========================================
+ Coverage    91.9%    91.9%   +<.01%     
==========================================
  Files         154      154              
  Lines       49656    49663       +7     
==========================================
+ Hits        45637    45644       +7     
  Misses       4019     4019
Flag Coverage Δ
#multiple 90.28% <100%> (ø) ⬆️
#single 42.02% <20%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/groupby/groupby.py 92.66% <100%> (ø) ⬆️
pandas/core/algorithms.py 94.89% <100%> (+0.03%) ⬆️
pandas/core/generic.py 96.21% <100%> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0b63e81...7b3cf75. Read the comment docs.

@gfyoung gfyoung added Bug Enhancement Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jul 3, 2018
expected_na = Series([4, 2, 4], index=['bar', 'foo', np.nan])

assert_series_equal(agged_na, expected_na, check_dtype=False)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reference issue number in a comment.

name='Vals'
)
assert_series_equal(agged_na, expected_na, check_dtype=False)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reference issue number in a comment.

uniques = np.resize(uniques, len(uniques) + 1)
uniques[-1] = np.nan
labels = np.where(labels == na_sentinel, new_na_sentinel, labels)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that you're using broadcasting and NumPy, I don't think we should have any performance issues, but I wonder if we should still add a benchmark test anyway.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs quite a bit of tests & to parameterize existing tests to accept nan groups. i don't like the keyword at all. haven't reviewed fully.

@andrewdalecramer
Copy link
Author

andrewdalecramer commented Jul 4, 2018 via email

@jreback
Copy link
Contributor

jreback commented Jul 4, 2018

@andrewdalecramer

this needs to be dropna=None as a default, with a warning if its not passed and a default of True for back-compat. Eventually we would want to make this False. This should only warn if there are actual nan groups (to avoid gratuitous warnings all over the place) I think.

@ppallesen
Copy link

@andrewdalecramer
Suggestion for keyword name: allow_na_groups = False

@jreback
Copy link
Contributor

jreback commented Oct 11, 2018

can you rebase this

@jreback
Copy link
Contributor

jreback commented Nov 23, 2018

this is a nice feature, but needs quite a bit of work to fully test this out.

closing as stale. if you'd like to continue, pls ping.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Enhancement Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: pivot/groupby index with nan
4 participants