-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: ENH: pivot/groupby index with nan #12607
Conversation
@@ -166,6 +167,8 @@ def factorize(values, sort=False, order=None, na_sentinel=-1, size_hint=None): | |||
na_sentinel : int, default -1 | |||
Value to mark "not found" | |||
size_hint : hint to the hashtable sizer | |||
dropna : boolean, default True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add version added directive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
Tests I still need to add:
|
@nbonnotte looking good. I think this needs quite a few more tests. This is actually a major change and needs some thorough testing. Essentially need to replicate lots of tests, just now with a NA group. |
can you rebase / update |
Codecov Report
@@ Coverage Diff @@
## master #12607 +/- ##
==========================================
+ Coverage 90.99% 91.04% +0.04%
==========================================
Files 153 137 -16
Lines 50469 49312 -1157
==========================================
- Hits 45926 44894 -1032
+ Misses 4543 4418 -125
Continue to review full report at Codecov.
|
@jreback Well, this is a lot more tricky than I thought. I'll need to touch to some cython function For instance, in the
we obtain The thing is, I'm now sailing into what is for me uncharted territory. Do you remember why it is necessary to check for NaN values? EDIT: this is in fact a completely separate issue (#8427), which has nothing to do with the matter. I tumbled on this while adapting the existing tests. |
can you rebase and i'll have a look |
@jreback Done it |
I've written a todo list with all the tests I think I might need to add / copy / adapt. I've put it in the first message, I'll be easier to find there I think. If I'm missing something, please let me know. |
The tests fail on Windows because of the check on dtype:
In the test, I had forced I'm going to use |
the changes for windows dtype comparison might signify and underlying problem |
@jreback I've removed the |
so need to expand the testing to include object (e.g. strings), and datetimelike (datetimes, timedelta, period) with nans. I think easiest is to create a new class in the Tester, e.g. TestGroupbyDropna and have the appropriate tests there (and name them the same as in TestGroupby). Alternative, you can move sections of tests to a separate testing class, e.g. TestGroupbyNth TestGroupbyAgg TestGroupbyTransform or even better is to reorge like other things tests/groupby/nth.py, agg.py, transform.py, basic.py etc. So. could move the tests first (to a sub-dir), or put them in separate classes for now |
can you rebase and we'll see where this is |
I've rebased. Why does Appveyor say I've cancelled the tests? |
thanks @nbonnotte appveyor is giving some troubles, they just fixed it, so requeing some things. |
pandas/core/algorithms.py
Outdated
dropna : boolean, default True | ||
Drop NaN values | ||
|
||
.. versionadded:: 0.19.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0.20.0
pandas/core/algorithms.py
Outdated
|
||
# the following line fills uniques: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove the comment (I generally like comments, but a more general comment or nothing is warranted here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a comment because it took me a very long time and some digging in the pyx files to understand what was going on. I am not expert enough to write a more general comment, but I would have liked to signal that to someone with the same knowledge of pandas as me, to avoid them the same difficulties I had
What should a more general comment look like?
@@ -4046,6 +4046,10 @@ def groupby(self, by=None, axis=0, level=None, as_index=True, sort=True, | |||
squeeze : boolean, default False | |||
reduce the dimensionality of the return type if possible, | |||
otherwise return a consistent type | |||
dropna : boolean, default True | |||
drop NaN in the grouping values | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0.20.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
pandas/core/groupby.py
Outdated
dropna : boolean, default True | ||
drop NaN in the grouping values | ||
|
||
.. versionadded:: 0.19.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
pandas/core/groupby.py
Outdated
@@ -2190,6 +2196,7 @@ class Grouping(object): | |||
level : | |||
in_axis : if the Grouping is a column in self.obj and hence among | |||
Groupby.exclusions list | |||
dropna | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a comment
pandas/tests/groupby/test_groupby.py
Outdated
@@ -120,6 +126,52 @@ def checkit(dtype): | |||
for dtype in ['int64', 'int32', 'float64', 'float32']: | |||
checkit(dtype) | |||
|
|||
def test_basic_with_nan(self): | |||
|
|||
# GH 3729 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment here
pandas/tests/groupby/test_groupby.py
Outdated
# GH 3729 | ||
|
||
def checkit(dtype): | ||
data = Series(np.arange(9) // 3, index=np.arange(9), dtype=dtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't specify an index unless it is used explicity (e.g. here its just a range which is the default)
pandas/tests/groupby/test_groupby.py
Outdated
# corner cases | ||
self.assertRaises(Exception, grouped.aggregate, lambda x: x * 2) | ||
|
||
for dtype in ['int64', 'int32', 'float64', 'float32']: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add object, datetimes, timedelta & datetime64 with tz
pandas/tests/groupby/test_groupby.py
Outdated
def checkit(dtype): | ||
data = Series(np.arange(9) // 3, index=np.arange(9), dtype=dtype) | ||
|
||
index = np.arange(9) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move all tests to a new class, maybe TestNanGrouping (you can even put it in a new file).
We need quite comprehensive testing and it will require several tests (over multiple dtypes)
pandas/tests/groupby/test_groupby.py
Outdated
@@ -1304,33 +1304,6 @@ def test_transform_coercion(self): | |||
result = g.transform(lambda x: np.mean(x)) | |||
assert_frame_equal(result, expected) | |||
|
|||
def test_with_na(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to the new file test_groupby_nan.py
|
||
from numpy import nan | ||
|
||
from pandas.core.index import Index, MultiIndex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just from pandas import Index, Series ......
|
||
result = df.groupby(df.b, dropna=False)['a'].transform(max) | ||
expected = pd.Series([1, 1, 2, 3, 4, 6, 6, 9, 9, 9], | ||
name='a', dtype=np.integer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback is this (ensuring dtype=np.integer
) an admissible way to deal with windows's int32/int64 issues? Without that, the test fail because result is int32 and expected is int64... which is maybe something I should fix, instead of tampering with the tests, but I have no idea how to do that. In this pull/request, I never actually touch the data...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nbonnotte the issue here is you should simply construct the test frames in a way that they are not platform dependent, IOW, use lists, or np.array with a specific dtype. range
and unspecified dtypes will yield int32
on windows (while int64
) on other platforms.
IOW
do
df = DataFrame({'A': np.arange(10, dtype='int64')})
from pandas.core.index import Index, MultiIndex | ||
from pandas.core.api import DataFrame | ||
from pandas.core.series import Series | ||
from pandas.util.testing import assert_frame_equal, assert_series_equal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from pandas.util import testing as tm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you do that below, so don't import these, just use them directly tm.assert_frame_equal
from .common import MixIn | ||
|
||
|
||
class TestGroupBy(MixIn, tm.TestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since we are writing new tests, pls use pytest
features, include parameterization. (IOW don't use a class, now use self.assert*
, instead use assert
(and of course tm.assert_series_equal
and such)
|
||
def setUp(self): | ||
MixIn.setUp(self) | ||
self.df_nan = DataFrame( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g. this becomes a fixture
# corner cases | ||
self.assertRaises(Exception, grouped.aggregate, lambda x: x * 2) | ||
|
||
for dtype in ['int64', 'int32', 'float64', 'float32']: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you parametrize this and simply write a function
name='a') | ||
assert_series_equal(result, expected) | ||
|
||
def test__cython_agg_general_nan(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 _
There is a problem when there is I guess the correct solution would be to convert all |
@nbonnotte we already do inference for things like this on |
I have a problem with my MultiIndexes:
So
@jreback Should I alter the test to make the multiindexes comparable? If yes, how? Or is there a problem with my code that I should fix instead? |
@nbonnotte having nans in an MultiIndex levels is an anti-pattern. How did you construct this? |
@nbonnotte how's this coming |
working on issue #3729
I'm not doing much, really. I think it's just that the groupings' labels aren't supposed to be NaNs. I need to fix this, obviously |
Finally got rid of ``geom._make_pinfos`. Had to create a wrapper `groupby_with_null` around `DataFrame.groupby` to allow grouping on columns with Null values. Almost at the same time a PR [1] popped up to probably solve this issue. --- [1] pandas-dev/pandas#12607
good ideas, but needs rebase / update. pls comment if you wish to proceed. |
Frankly, it's the number of tests to add or update that killed me. |
@nbonnotte I'd like to take this up if you are not working on this. Can you confirm? |
@TrigonaMinima No, I'm no longer working on it :) |
I see this as a major drawback of the powerful groupby functions in pandas and the first time I came upon it, without any error messages, the return of an empty DataFrame really threw me off. I hope the previous work can be finished. |
I'm working on a solution for issue #3729.
I've identified where to work and put in place a first, crude solution. I need to do more tests.
git diff upstream/master | flake8 --diff
xref #9941, #5456, #6992, #443
2016-08-28: here are the tests I think need to be added / copied /adapted:
Basic tests:
Tests with data types
Existing tests to copy or adapt from TestGroupBy:
And also I will need to edit the documentation, for instance this part: