Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: ENH: pivot/groupby index with nan #12607

Closed
wants to merge 2 commits into from
Closed

WIP: ENH: pivot/groupby index with nan #12607

wants to merge 2 commits into from

Conversation

nbonnotte
Copy link
Contributor

@nbonnotte nbonnotte commented Mar 13, 2016

I'm working on a solution for issue #3729.

I've identified where to work and put in place a first, crude solution. I need to do more tests.

xref #9941, #5456, #6992, #443


2016-08-28: here are the tests I think need to be added / copied /adapted:

Basic tests:

  • Create a new test class, in a new file
  • test_basic_with_nan with dtypes in object, datetimes, timedelta & datetime64 with tz
  • test_groupby_series_dropna
  • test_groupby_frame_dropna
  • test_groupby_panel_dropna
  • test_groupby_multi_dropna

Tests with data types

  • integer, floats
  • datetime, timedelta
  • strings and other objects
  • None

Existing tests to copy or adapt from TestGroupBy:

  • test_first_last_nth
  • test_nth
  • test_nth_multi_index_as_expected
  • test_nth_multi_index
  • test_group_selection_cache
  • test_basic
  • test_with_na
  • test_series_groupby_nunique
  • test_groupby_level_with_nas
  • test_groupby_nat_exclude
  • test_groupby_groups_datetimeindex
  • test_nlargest
  • test_nsmallest
  • test_groupby_as_index_agg
  • test_agg_api
  • test_transform
  • test_builtins_apply
  • test_groupby_whitelist
  • test__cython_agg_general
  • test_ops_general
  • test_groupby_transform_with_nan_group

And also I will need to edit the documentation, for instance this part:

@@ -166,6 +167,8 @@ def factorize(values, sort=False, order=None, na_sentinel=-1, size_hint=None):
na_sentinel : int, default -1
Value to mark "not found"
size_hint : hint to the hashtable sizer
dropna : boolean, default True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add version added directive

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done!

@jreback jreback added Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Mar 13, 2016
@nbonnotte
Copy link
Contributor Author

Tests I still need to add:

  • datetimes with NaT
  • panel

@jreback
Copy link
Contributor

jreback commented Mar 22, 2016

@nbonnotte looking good. I think this needs quite a few more tests. This is actually a major change and needs some thorough testing. Essentially need to replicate lots of tests, just now with a NA group.

@jreback
Copy link
Contributor

jreback commented May 7, 2016

can you rebase / update

@codecov-io
Copy link

codecov-io commented Jun 1, 2016

Codecov Report

Merging #12607 into master will increase coverage by 0.04%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #12607      +/-   ##
==========================================
+ Coverage   90.99%   91.04%   +0.04%     
==========================================
  Files         153      137      -16     
  Lines       50469    49312    -1157     
==========================================
- Hits        45926    44894    -1032     
+ Misses       4543     4418     -125
Flag Coverage Δ
#multiple ?
#single ?
Impacted Files Coverage Δ
pandas/core/generic.py 96.25% <ø> (-0.04%) ⬇️
pandas/core/groupby.py 95% <100%> (+2.97%) ⬆️
pandas/core/algorithms.py 94.48% <100%> (+0.02%) ⬆️
pandas/tools/plotting.py 71.79% <0%> (-28.21%) ⬇️
pandas/io/pickle.py 74.28% <0%> (-5.26%) ⬇️
pandas/conftest.py 91.66% <0%> (-3.79%) ⬇️
pandas/util/_tester.py 35.29% <0%> (-3.6%) ⬇️
pandas/util/decorators.py 62.26% <0%> (-3.05%) ⬇️
pandas/tools/merge.py 91.78% <0%> (-1.91%) ⬇️
pandas/__init__.py 91.66% <0%> (-1.52%) ⬇️
... and 152 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1a117fc...f7a4c60. Read the comment docs.

@jreback jreback added this to the 0.19.0 milestone Jun 4, 2016
@nbonnotte
Copy link
Contributor Author

nbonnotte commented Jun 19, 2016

@jreback Well, this is a lot more tricky than I thought. I'll need to touch to some cython function

For instance, in the last aggregation method, some NaN values are forgotten. If the last two lines of a group are

[[1,   2]
 [NaN, 3]]

we obtain [1, 3], because the NaN value is not considered.

The thing is, I'm now sailing into what is for me uncharted territory.

Do you remember why it is necessary to check for NaN values?

EDIT: this is in fact a completely separate issue (#8427), which has nothing to do with the matter. I tumbled on this while adapting the existing tests.

@jorisvandenbossche jorisvandenbossche modified the milestones: 0.19.0, 0.20.0 Jul 8, 2016
@jreback
Copy link
Contributor

jreback commented Jul 15, 2016

can you rebase and i'll have a look

@jreback jreback modified the milestones: 0.20.0, 0.19.0 Jul 20, 2016
@nbonnotte
Copy link
Contributor Author

@jreback Done it

@nbonnotte
Copy link
Contributor Author

I've written a todo list with all the tests I think I might need to add / copy / adapt. I've put it in the first message, I'll be easier to find there I think.

If I'm missing something, please let me know.

@nbonnotte
Copy link
Contributor Author

The tests fail on Windows because of the check on dtype:

FAIL: test_groupby_series_dropna (pandas.tests.test_groupby.TestGroupBy)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Python27_64\envs\pandas\lib\site-packages\pandas\tests\test_groupby.py", line 794, in test_groupby_series_dropna
    assert_series_equal(result, expected)
  File "C:\Python27_64\envs\pandas\lib\site-packages\pandas\util\testing.py", line 1154, in assert_series_equal
    assert_attr_equal('dtype', left, right)
  File "C:\Python27_64\envs\pandas\lib\site-packages\pandas\util\testing.py", line 878, in assert_attr_equal
    left_attr, right_attr)
  File "C:\Python27_64\envs\pandas\lib\site-packages\pandas\util\testing.py", line 1018, in raise_assert_detail
    raise AssertionError(msg)
AssertionError: Attributes are different
Attribute "dtype" are different
[left]:  int64
[right]: int32

In the test, I had forced dtype=int, which means int32 on Windows 64b (?)

I'm going to use assert_stuff_equal with check_dtype=False.

@jreback
Copy link
Contributor

jreback commented Oct 2, 2016

the changes for windows dtype comparison might signify and underlying problem
don't change the comparisons unless the expected is being contructed with a ndarray
and pls note these test

@nbonnotte
Copy link
Contributor Author

@jreback I've removed the check_dtype=False, and the dtype=int I had put (I don't even remember why) which created the problem in the first place

@jreback
Copy link
Contributor

jreback commented Dec 6, 2016

so need to expand the testing to include object (e.g. strings), and datetimelike (datetimes, timedelta, period) with nans.

I think easiest is to create a new class in the Tester, e.g. TestGroupbyDropna and have the appropriate tests there (and name them the same as in TestGroupby).

Alternative, you can move sections of tests to a separate testing class, e.g.

TestGroupbyNth

TestGroupbyAgg

TestGroupbyTransform

or even better is to reorge like other things

tests/groupby/nth.py, agg.py, transform.py, basic.py

etc.

So. could move the tests first (to a sub-dir), or put them in separate classes for now

@jreback
Copy link
Contributor

jreback commented Dec 30, 2016

can you rebase and we'll see where this is

@nbonnotte
Copy link
Contributor Author

I've rebased.

Why does Appveyor say I've cancelled the tests?

@jreback
Copy link
Contributor

jreback commented Jan 17, 2017

thanks @nbonnotte

appveyor is giving some troubles, they just fixed it, so requeing some things.

dropna : boolean, default True
Drop NaN values

.. versionadded:: 0.19.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0.20.0


# the following line fills uniques:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the comment (I generally like comments, but a more general comment or nothing is warranted here)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a comment because it took me a very long time and some digging in the pyx files to understand what was going on. I am not expert enough to write a more general comment, but I would have liked to signal that to someone with the same knowledge of pandas as me, to avoid them the same difficulties I had

What should a more general comment look like?

@@ -4046,6 +4046,10 @@ def groupby(self, by=None, axis=0, level=None, as_index=True, sort=True,
squeeze : boolean, default False
reduce the dimensionality of the return type if possible,
otherwise return a consistent type
dropna : boolean, default True
drop NaN in the grouping values

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0.20.0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

dropna : boolean, default True
drop NaN in the grouping values

.. versionadded:: 0.19.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

@@ -2190,6 +2196,7 @@ class Grouping(object):
level :
in_axis : if the Grouping is a column in self.obj and hence among
Groupby.exclusions list
dropna

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment

@@ -120,6 +126,52 @@ def checkit(dtype):
for dtype in ['int64', 'int32', 'float64', 'float32']:
checkit(dtype)

def test_basic_with_nan(self):

# GH 3729
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment here

# GH 3729

def checkit(dtype):
data = Series(np.arange(9) // 3, index=np.arange(9), dtype=dtype)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't specify an index unless it is used explicity (e.g. here its just a range which is the default)

# corner cases
self.assertRaises(Exception, grouped.aggregate, lambda x: x * 2)

for dtype in ['int64', 'int32', 'float64', 'float32']:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add object, datetimes, timedelta & datetime64 with tz

def checkit(dtype):
data = Series(np.arange(9) // 3, index=np.arange(9), dtype=dtype)

index = np.arange(9)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move all tests to a new class, maybe TestNanGrouping (you can even put it in a new file).

We need quite comprehensive testing and it will require several tests (over multiple dtypes)

@@ -1304,33 +1304,6 @@ def test_transform_coercion(self):
result = g.transform(lambda x: np.mean(x))
assert_frame_equal(result, expected)

def test_with_na(self):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to the new file test_groupby_nan.py


from numpy import nan

from pandas.core.index import Index, MultiIndex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just from pandas import Index, Series ......


result = df.groupby(df.b, dropna=False)['a'].transform(max)
expected = pd.Series([1, 1, 2, 3, 4, 6, 6, 9, 9, 9],
name='a', dtype=np.integer)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback is this (ensuring dtype=np.integer) an admissible way to deal with windows's int32/int64 issues? Without that, the test fail because result is int32 and expected is int64... which is maybe something I should fix, instead of tampering with the tests, but I have no idea how to do that. In this pull/request, I never actually touch the data...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nbonnotte the issue here is you should simply construct the test frames in a way that they are not platform dependent, IOW, use lists, or np.array with a specific dtype. range and unspecified dtypes will yield int32 on windows (while int64) on other platforms.

IOW

do

df = DataFrame({'A': np.arange(10, dtype='int64')})

from pandas.core.index import Index, MultiIndex
from pandas.core.api import DataFrame
from pandas.core.series import Series
from pandas.util.testing import assert_frame_equal, assert_series_equal
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from pandas.util import testing as tm

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you do that below, so don't import these, just use them directly tm.assert_frame_equal

from .common import MixIn


class TestGroupBy(MixIn, tm.TestCase):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we are writing new tests, pls use pytest features, include parameterization. (IOW don't use a class, now use self.assert*, instead use assert (and of course tm.assert_series_equal and such)


def setUp(self):
MixIn.setUp(self)
self.df_nan = DataFrame(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g. this becomes a fixture

# corner cases
self.assertRaises(Exception, grouped.aggregate, lambda x: x * 2)

for dtype in ['int64', 'int32', 'float64', 'float32']:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you parametrize this and simply write a function

name='a')
assert_series_equal(result, expected)

def test__cython_agg_general_nan(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 _

@nbonnotte
Copy link
Contributor Author

nbonnotte commented Mar 5, 2017

There is a problem when there is None in the grouping values. As it is, None is treated differently from NaN, and then there is an issue during sorting.

I guess the correct solution would be to convert all None to NaN... Let's see how I can do that

@jreback
Copy link
Contributor

jreback commented Mar 5, 2017

@nbonnotte we already do inference for things like this on Grouping, for datetimelike (e.g. things are set as NaT). So you will need to replace in object series None->np.nan. But this should be a very minimal type of check.

@nbonnotte
Copy link
Contributor Author

I have a problem with my MultiIndexes:

(Pdb) left.index
MultiIndex(levels=[[u'e', nan, u'c', u'y', u's', u'w', u'z', u'u', u'n', u't', u'b', u'm', u'o', u'g', u'p', u'j', u'q', u'a', u'v', u'x', u'k', u'f', u'r', u'h', u'i', u'l'], [2015-08-24 00:00:00, 2015-08-30 00:00:00, 2015-09-01 00:00:00, NaT, 2015-08-31 00:00:00, 2015-08-27 00:00:00, 2015-08-29 00:00:00, 2015-08-26 00:00:00, 2015-08-25 00:00:00, 2015-08-28 00:00:00, 2015-08-23 00:00:00]],
           labels=[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 2, 7, 4, 11, 3, 8, 1, 12, 9, 13, 14, 15, 16, 17, 13, 18, 5, 11, 15, 6, 17, 1, 19, 7, 14, 20, 15, 17, 18, 3, 17, 18, 19, 21, 21, 12, 1, 12, 22, 20, 0, 23, 8, 5, 5, 21, 4, 23, 6, 23, 20, 9, 24, 23, 6, 22, 12, 4, 23, 19, 10, 25, 0, 17, 12, 10, 10, 22, 0], [0, 1, 2, 3, 0, 4, 5, 0, 6, 7, 7, 8, 7, 4, 5, 4, 9, 5, 4, 0, 4, 7, 9, 10, 9, 6, 7, 7, 1, 0, 4, 2, 7, 8, 9, 1, 4, 3, 10, 8, 9, 4, 10, 7, 5, 1, 2, 10, 5, 4, 9, 6, 6, 1, 5, 0, 8, 8, 0, 2, 5, 10, 2, 0, 8, 3, 8, 1, 9, 7, 1, 2, 10, 4, 0, 6, 9, 5, 7, 10]],
           names=[u'jim', u'joe'])

(Pdb) right.index
MultiIndex(levels=[[u'a', u'b', u'c', u'e', u'f', u'g', u'h', u'i', u'j', u'k', u'l', u'm', u'n', u'o', u'p', u'q', u'r', u's', u't', u'u', u'v', u'w', u'x', u'y', u'z'], [2015-08-23 00:00:00, 2015-08-24 00:00:00, 2015-08-25 00:00:00, 2015-08-26 00:00:00, 2015-08-27 00:00:00, 2015-08-28 00:00:00, 2015-08-29 00:00:00, 2015-08-30 00:00:00, 2015-08-31 00:00:00, 2015-09-01 00:00:00]],
           labels=[[3, -1, 2, 23, 17, 21, 24, 19, 12, 18, 1, 2, 19, 17, 11, 23, 12, -1, 13, 18, 5, 14, 8, 15, 0, 5, 20, 21, 11, 8, 24, 0, -1, 22, 19, 14, 9, 8, 0, 20, 23, 0, 20, 22, 4, 4, 13, -1, 13, 16, 9, 3, 6, 12, 21, 21, 4, 17, 6, 24, 6, 9, 18, 7, 6, 24, 16, 13, 17, 6, 22, 1, 10, 3, 0, 13, 1, 1, 16, 3], [1, 7, 9, -1, 1, 8, 4, 1, 6, 3, 3, 2, 3, 8, 4, 8, 5, 4, 8, 1, 8, 3, 5, 0, 5, 6, 3, 3, 7, 1, 8, 9, 3, 2, 5, 7, 8, -1, 0, 2, 5, 8, 0, 3, 4, 7, 9, 0, 4, 8, 5, 6, 6, 7, 4, 1, 2, 2, 1, 9, 4, 0, 9, 1, 2, -1, 2, 7, 5, 3, 7, 9, 0, 8, 1, 6, 5, 4, 3, 0]],
           names=[u'jim', u'joe'])

So left has explicit NaNs values, and not right, so we get

(Pdb) left.index.levels[0].inferred_type
'mixed'

(Pdb) right.index.levels[0].inferred_type
'string'

@jreback Should I alter the test to make the multiindexes comparable? If yes, how? Or is there a problem with my code that I should fix instead?

@jreback
Copy link
Contributor

jreback commented Mar 7, 2017

@nbonnotte having nans in an MultiIndex levels is an anti-pattern. How did you construct this?

@jreback jreback removed this from the 0.20.0 milestone Mar 23, 2017
@jreback
Copy link
Contributor

jreback commented Mar 23, 2017

@nbonnotte how's this coming

@nbonnotte
Copy link
Contributor Author

I'm not doing much, really. I think it's just that the groupings' labels aren't supposed to be NaNs. I need to fix this, obviously

has2k1 added a commit to has2k1/plotnine that referenced this pull request Apr 25, 2017
Finally got rid of ``geom._make_pinfos`. Had to create a wrapper
`groupby_with_null` around `DataFrame.groupby` to allow grouping
on columns with Null values. Almost at the same time a PR [1]
popped up to probably solve this issue.

---

[1] pandas-dev/pandas#12607
@jreback
Copy link
Contributor

jreback commented Jul 26, 2017

good ideas, but needs rebase / update. pls comment if you wish to proceed.

@jreback jreback closed this Jul 26, 2017
@gfyoung gfyoung added this to the No action milestone Jul 26, 2017
@nbonnotte
Copy link
Contributor Author

Frankly, it's the number of tests to add or update that killed me.

@TrigonaMinima
Copy link

@nbonnotte I'd like to take this up if you are not working on this. Can you confirm?

@nbonnotte
Copy link
Contributor Author

@TrigonaMinima No, I'm no longer working on it :)

@WesIngwersen
Copy link

I see this as a major drawback of the powerful groupby functions in pandas and the first time I came upon it, without any error messages, the return of an empty DataFrame really threw me off. I hope the previous work can be finished.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: pivot/groupby index with nan
7 participants