Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: GroupBy.get_group raises ValueError when group key contains NaT #6996

Merged
merged 1 commit into from
May 30, 2015

Conversation

sinhrks
Copy link
Member

@sinhrks sinhrks commented Apr 28, 2014

Closes #6992. Made GroupBy.get_group works even if the key contains NaT.

NOTE: One issue is that GroupBy.groups returns incorrect key in numpy 1.6. This seems to be caused by _convert_grouper uses grouper.reindex(axis).values to return value. This looks doesn't affect to main functionalities, but is there any expected result?

import pandas as pd
import numpy as np
>>> np.__version__
1.6.2

>>> df = pd.DataFrame({'values': np.random.randn(8), 
                   'dt': [np.nan, pd.Timestamp('2013-01-01'), np.nan, pd.Timestamp('2013-02-01'),
                          np.nan, pd.Timestamp('2013-02-01'), np.nan, pd.Timestamp('2013-01-01')]})
>>> grouped = df.groupby('dt')

>>> grouped.groups
{1970-01-16 48:00:00: [1, 7], 1970-01-16 24:00:00: [3, 5]}

@@ -1809,7 +1810,13 @@ def _make_labels(self):
@property
def groups(self):
if self._groups is None:
self._groups = self.index.groupby(self.grouper)
if isinstance(self.grouper, DatetimeIndex):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think better to add an argument to Index.groupby and DatetimeIndex.groupby, maybe exclude_missing=False as the default (and you will pass True here), then you can leave groupby alone and do the exlusions inside the index?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I've changed the logic and would like to ask about keyword.

Considering #3729, I preferred to dropna as a keyword, and now it uses dropna. But looks better to use exclude_missing because pivot_table already has dropna for different purpose (to exclude index/columns from result which values are all NaN).

Is it OK to add exclude_missing to Dataframe.groupby and pivot_table as new keyword?

Also, I think default should be dropna=True based on current behavior which exclude NaT as nan.

@jreback jreback added this to the 0.14.1 milestone May 1, 2014
@jreback
Copy link
Contributor

jreback commented May 2, 2014

you can't have nan in the int groupby nor datetime

the cython routines are very strict about types

so u don't need to check for a non existsnt type

furthermore the datetimes won't be passed directly but as a i8 dtype

@sinhrks
Copy link
Member Author

sinhrks commented May 2, 2014

I understand nan should not be included in values of such indexes, but can be passed as to_groupby argument.

Existing cython logic doesn't care about index instance, but check nan for the argument. I didn't change the core logic, and added to Nat handling condition.

@jreback
Copy link
Contributor

jreback commented May 24, 2014

@sinhrks haven't really looked at this, but does this solve #7227 as well?

@sinhrks
Copy link
Member Author

sinhrks commented May 24, 2014

@jreback Unfortunately No. #7227 seems to be caused by TimeGrouper, and the fix is only for normal groupby. I'll take a deeper look later.

@@ -1924,7 +1925,7 @@ def groupby_indices(ndarray values):
k = labels[i]

# was NaN
if k == -1:
if k == -1 or k is tslib.NaT:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do something like

cdef Py_ssize_t NaT = tslib.NaT
# ...
k == NaT

no need for any overhead of attribute access.

@sinhrks
Copy link
Member Author

sinhrks commented Jun 5, 2014

In above example to group by datetime column, labels argument for groupby_template will be DatetimeIndex, not i8. So I understand all functions created by groupby_template should have the logic to exclude NaT

index = [0 1 2 3 4 5 6 7]
labels = <class 'pandas.tseries.index.DatetimeIndex'>
[NaT, ..., 2013-01-01]
Length: 8, Freq: None, Timezone: None

@jreback
Copy link
Contributor

jreback commented Jun 5, 2014

hmm...DatetimeIndex should never be passed to a cython function, ONLY i8. can you show me where that is the case? (you can pass as object dtype, but i8 is MUCH faster so always pass that way)

@sinhrks
Copy link
Member Author

sinhrks commented Jun 6, 2014

@jreback The behavior seems to be caused by Grouping.
https://github.com/pydata/pandas/blob/master/pandas/core/groupby.py#L1881-1884

grouper is converted to datetime in following lines.
https://github.com/pydata/pandas/blob/master/pandas/core/groupby.py#L1831-1838

@jreback
Copy link
Contributor

jreback commented Jun 6, 2014

@sinhrks that doesn't seem a problem, its when its actually passed to cython that needs conversion (to and from), can you investigate?

@jreback
Copy link
Contributor

jreback commented Jun 19, 2014

@sinhrks can you rebase and address comments?

@sinhrks
Copy link
Member Author

sinhrks commented Jun 20, 2014

@jreback You mean to convert to_groupby (group key) just before passed to cython func, correct?
https://github.com/pydata/pandas/blob/master/pandas/core/index.py#L1330-1331

Conversion looks possible if to_gorupby is datetime, but it also can be an object dtype.
Thus I think cython funcs should handle object in this case?

@jreback
Copy link
Contributor

jreback commented Jun 20, 2014

I think u should convert to i8 if possible (and reverse after) - evn though I guess it's handled as object now

@jreback
Copy link
Contributor

jreback commented Jun 24, 2014

#7560 does this conversion now (actually these are converted to float then coerced back).

So rebase after I merge that and let's take another look

@jreback
Copy link
Contributor

jreback commented Jun 26, 2014

can you show groupby benches after this change..thanks

@sinhrks
Copy link
Member Author

sinhrks commented Jun 26, 2014

Thanks, but #7560 doesn't convert group keys, correct? Thus, _algos.groupby_xxx functions are still receives objects as labels argument even after #7560.

I think dtype should be converted in Index.groupby before calling cython funcs. Is it ok to change Index.groupby only to accept ndarray, or is it categorized as public API?

@jreback
Copy link
Contributor

jreback commented Jun 26, 2014

Ahh, yes was thinking of something else.

Index.groupby is ONLY used by groupby IIRC, so its not really public (so ok to change).

@sinhrks
Copy link
Member Author

sinhrks commented Jun 26, 2014

OK, let me try. Maybe there is no side-effect, because currently _algos.groupby_xxx only accepts ndarray (thus the point which error occurrs being changed if something passed other than ndarray).

@sinhrks
Copy link
Member Author

sinhrks commented Jun 26, 2014

@jreback I made it work when group-key has datelike dtypes, is this what you mean?

Have to consider object dtype case and better way to coercing dtype back.

@jreback
Copy link
Contributor

jreback commented Jul 1, 2014

can you rebase and revisit this...thxs

@jreback
Copy link
Contributor

jreback commented Jul 6, 2014

pushing if you get 2 this in next day or 2 lmk

@jreback jreback modified the milestones: 0.15.0, 0.14.1, 0.15.1 Jul 6, 2014
@jreback
Copy link
Contributor

jreback commented May 9, 2015

@sinhrks let's do this for 0.17.0

@jreback jreback modified the milestones: 0.17.0, Next Major Release May 9, 2015
@sinhrks sinhrks force-pushed the groupnat branch 5 times, most recently from 1731852 to ca22e12 Compare May 15, 2015 17:49
@sinhrks
Copy link
Member Author

sinhrks commented May 15, 2015

Yes, fixed to work on the latest master.

@jreback
Copy link
Contributor

jreback commented May 15, 2015

@sinhrks sinhrks force-pushed the groupnat branch 2 times, most recently from f91a28c to fe8860e Compare May 15, 2015 21:56
@sinhrks
Copy link
Member Author

sinhrks commented May 15, 2015

Thanks, I haven't noticed that.

@@ -2010,7 +2012,7 @@ def groupby_indices(ndarray values):
k = labels[i]

# was NaN
if k == -1:
if k == -1 or k is NaT:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't look right. these are indices, not actual objects, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, actually no need to change algos.pyx. Fixed.

@sinhrks sinhrks force-pushed the groupnat branch 3 times, most recently from c423fcd to 6e79756 Compare May 21, 2015 13:31
@sinhrks
Copy link
Member Author

sinhrks commented May 28, 2015

@jreback pls check if there is anything I should do?

@jreback
Copy link
Contributor

jreback commented May 28, 2015

lgtm....go ahead and merged

@sinhrks
Copy link
Member Author

sinhrks commented May 28, 2015

Thanks, and sorry I found the doc header alignment is incorrect. Will fix it then merge if nothing else.

sinhrks added a commit that referenced this pull request May 30, 2015
BUG: GroupBy.get_group raises ValueError when group key contains NaT
@sinhrks sinhrks merged commit 753d66b into pandas-dev:master May 30, 2015
@sinhrks sinhrks deleted the groupnat branch May 30, 2015 13:49
@jorisvandenbossche jorisvandenbossche modified the milestones: 0.17.0, 0.16.2 Jun 2, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: Groupby NaT Handling
4 participants