WIP: ENH: pivot/groupby index with nan #12607

nbonnotte · 2016-03-13T14:46:55Z

I'm working on a solution for issue #3729.

I've identified where to work and put in place a first, crude solution. I need to do more tests.

closes ENH: pivot/groupby index with nan #3729.
tests added / passed
passes git diff upstream/master | flake8 --diff
whatsnew entry

xref #9941, #5456, #6992, #443

2016-08-28: here are the tests I think need to be added / copied /adapted:

Basic tests:

Create a new test class, in a new file
test_basic_with_nan with dtypes in object, datetimes, timedelta & datetime64 with tz
test_groupby_series_dropna
test_groupby_frame_dropna
test_groupby_panel_dropna
test_groupby_multi_dropna

Tests with data types

integer, floats
datetime, timedelta
strings and other objects
None

Existing tests to copy or adapt from TestGroupBy:

And also I will need to edit the documentation, for instance this part:

Edit http://pandas-docs.github.io/pandas-docs-travis/groupby.html

jreback · 2016-03-13T16:34:52Z

pandas/core/algorithms.py

@@ -166,6 +167,8 @@ def factorize(values, sort=False, order=None, na_sentinel=-1, size_hint=None):
    na_sentinel : int, default -1
        Value to mark "not found"
    size_hint : hint to the hashtable sizer
+    dropna : boolean, default True


add version added directive

nbonnotte · 2016-03-22T07:20:52Z

Tests I still need to add:

datetimes with NaT
panel

jreback · 2016-03-22T13:17:52Z

@nbonnotte looking good. I think this needs quite a few more tests. This is actually a major change and needs some thorough testing. Essentially need to replicate lots of tests, just now with a NA group.

jreback · 2016-05-07T17:52:25Z

can you rebase / update

codecov-io · 2016-06-01T19:25:13Z

Codecov Report

Merging #12607 into master will increase coverage by 0.04%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #12607      +/-   ##
==========================================
+ Coverage   90.99%   91.04%   +0.04%     
==========================================
  Files         153      137      -16     
  Lines       50469    49312    -1157     
==========================================
- Hits        45926    44894    -1032     
+ Misses       4543     4418     -125

Flag	Coverage Δ
#multiple	`?`
#single	`?`

Impacted Files	Coverage Δ
pandas/core/generic.py	`96.25% <ø> (-0.04%)`	⬇️
pandas/core/groupby.py	`95% <100%> (+2.97%)`	⬆️
pandas/core/algorithms.py	`94.48% <100%> (+0.02%)`	⬆️
pandas/tools/plotting.py	`71.79% <0%> (-28.21%)`	⬇️
pandas/io/pickle.py	`74.28% <0%> (-5.26%)`	⬇️
pandas/conftest.py	`91.66% <0%> (-3.79%)`	⬇️
pandas/util/_tester.py	`35.29% <0%> (-3.6%)`	⬇️
pandas/util/decorators.py	`62.26% <0%> (-3.05%)`	⬇️
pandas/tools/merge.py	`91.78% <0%> (-1.91%)`	⬇️
pandas/__init__.py	`91.66% <0%> (-1.52%)`	⬇️
... and 152 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1a117fc...f7a4c60. Read the comment docs.

nbonnotte · 2016-06-19T07:35:57Z

@jreback Well, this is a lot more tricky than I thought. I'll need to touch to some cython function

For instance, in the last aggregation method, some NaN values are forgotten. If the last two lines of a group are

[[1,   2]
 [NaN, 3]]

we obtain [1, 3], because the NaN value is not considered.

The thing is, I'm now sailing into what is for me uncharted territory.

Do you remember why it is necessary to check for NaN values?

EDIT: this is in fact a completely separate issue (#8427), which has nothing to do with the matter. I tumbled on this while adapting the existing tests.

jreback · 2016-07-15T10:44:50Z

can you rebase and i'll have a look

nbonnotte · 2016-07-26T19:35:37Z

@jreback Done it

nbonnotte · 2016-08-28T18:38:54Z

I've written a todo list with all the tests I think I might need to add / copy / adapt. I've put it in the first message, I'll be easier to find there I think.

If I'm missing something, please let me know.

nbonnotte · 2016-10-02T07:45:15Z

The tests fail on Windows because of the check on dtype:

FAIL: test_groupby_series_dropna (pandas.tests.test_groupby.TestGroupBy)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Python27_64\envs\pandas\lib\site-packages\pandas\tests\test_groupby.py", line 794, in test_groupby_series_dropna
    assert_series_equal(result, expected)
  File "C:\Python27_64\envs\pandas\lib\site-packages\pandas\util\testing.py", line 1154, in assert_series_equal
    assert_attr_equal('dtype', left, right)
  File "C:\Python27_64\envs\pandas\lib\site-packages\pandas\util\testing.py", line 878, in assert_attr_equal
    left_attr, right_attr)
  File "C:\Python27_64\envs\pandas\lib\site-packages\pandas\util\testing.py", line 1018, in raise_assert_detail
    raise AssertionError(msg)
AssertionError: Attributes are different
Attribute "dtype" are different
[left]:  int64
[right]: int32

In the test, I had forced dtype=int, which means int32 on Windows 64b (?)

I'm going to use assert_stuff_equal with check_dtype=False.

jreback · 2016-10-02T11:49:27Z

the changes for windows dtype comparison might signify and underlying problem
don't change the comparisons unless the expected is being contructed with a ndarray
and pls note these test

nbonnotte · 2016-10-03T05:32:21Z

@jreback I've removed the check_dtype=False, and the dtype=int I had put (I don't even remember why) which created the problem in the first place

jreback · 2016-12-06T23:46:41Z

so need to expand the testing to include object (e.g. strings), and datetimelike (datetimes, timedelta, period) with nans.

I think easiest is to create a new class in the Tester, e.g. TestGroupbyDropna and have the appropriate tests there (and name them the same as in TestGroupby).

Alternative, you can move sections of tests to a separate testing class, e.g.

TestGroupbyNth

TestGroupbyAgg

TestGroupbyTransform

or even better is to reorge like other things

tests/groupby/nth.py, agg.py, transform.py, basic.py

etc.

So. could move the tests first (to a sub-dir), or put them in separate classes for now

jreback · 2016-12-30T21:24:01Z

can you rebase and we'll see where this is

nbonnotte · 2017-01-17T19:23:31Z

I've rebased.

Why does Appveyor say I've cancelled the tests?

jreback · 2017-01-17T19:24:41Z

thanks @nbonnotte

appveyor is giving some troubles, they just fixed it, so requeing some things.

jreback · 2017-01-17T19:25:07Z

pandas/core/algorithms.py

+    dropna : boolean, default True
+        Drop NaN values
+
+        .. versionadded:: 0.19.0


jreback · 2017-01-17T19:26:06Z

pandas/core/algorithms.py


+    # the following line fills uniques:


remove the comment (I generally like comments, but a more general comment or nothing is warranted here)

I added a comment because it took me a very long time and some digging in the pyx files to understand what was going on. I am not expert enough to write a more general comment, but I would have liked to signal that to someone with the same knowledge of pandas as me, to avoid them the same difficulties I had

What should a more general comment look like?

jreback · 2017-01-17T19:26:18Z

pandas/core/generic.py

@@ -4046,6 +4046,10 @@ def groupby(self, by=None, axis=0, level=None, as_index=True, sort=True,
        squeeze : boolean, default False
            reduce the dimensionality of the return type if possible,
            otherwise return a consistent type
+        dropna : boolean, default True
+            drop NaN in the grouping values
+


jreback · 2017-01-17T19:26:33Z

pandas/core/groupby.py

+    dropna : boolean, default True
+        drop NaN in the grouping values
+
+        .. versionadded:: 0.19.0


jreback · 2017-01-17T19:26:43Z

pandas/core/groupby.py

@@ -2190,6 +2196,7 @@ class Grouping(object):
    level :
    in_axis : if the Grouping is a column in self.obj and hence among
        Groupby.exclusions list
+    dropna



add a comment

jreback · 2017-01-17T19:27:02Z

pandas/tests/groupby/test_groupby.py

@@ -120,6 +126,52 @@ def checkit(dtype):
        for dtype in ['int64', 'int32', 'float64', 'float32']:
            checkit(dtype)

+    def test_basic_with_nan(self):
+
+        # GH 3729


comment here

jreback · 2017-01-17T19:27:34Z

pandas/tests/groupby/test_groupby.py

+        # GH 3729
+
+        def checkit(dtype):
+            data = Series(np.arange(9) // 3, index=np.arange(9), dtype=dtype)


don't specify an index unless it is used explicity (e.g. here its just a range which is the default)

jreback · 2017-01-17T19:28:50Z

pandas/tests/groupby/test_groupby.py

+            # corner cases
+            self.assertRaises(Exception, grouped.aggregate, lambda x: x * 2)
+
+        for dtype in ['int64', 'int32', 'float64', 'float32']:


add object, datetimes, timedelta & datetime64 with tz

jreback · 2017-01-17T19:30:23Z

pandas/tests/groupby/test_groupby.py

+        def checkit(dtype):
+            data = Series(np.arange(9) // 3, index=np.arange(9), dtype=dtype)
+
+            index = np.arange(9)


move all tests to a new class, maybe TestNanGrouping (you can even put it in a new file).

We need quite comprehensive testing and it will require several tests (over multiple dtypes)

nbonnotte · 2017-01-19T19:42:02Z

pandas/tests/groupby/test_groupby.py

@@ -1304,33 +1304,6 @@ def test_transform_coercion(self):
        result = g.transform(lambda x: np.mean(x))
        assert_frame_equal(result, expected)

-    def test_with_na(self):


Moved to the new file test_groupby_nan.py

jreback · 2017-03-03T13:38:02Z