Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame.groupby fails with MultiIndex containing pd.NaT #9236

Closed
stevenmanton opened this issue Jan 13, 2015 · 8 comments · Fixed by #25310
Closed

DataFrame.groupby fails with MultiIndex containing pd.NaT #9236

stevenmanton opened this issue Jan 13, 2015 · 8 comments · Fixed by #25310
Labels
Bug good first issue Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Needs Tests Unit test(s) needed to prevent regressions
Milestone

Comments

@stevenmanton
Copy link

It seems that the groupby operation fails when the row index is a MultiIndex containing NaT values. For example, the following code fails (v0.15.2) with TypeError: 'numpy.ndarray' object is not callable:

midx = pd.MultiIndex(levels=[[pd.NaT, pd.datetime(2012,1,2), 
                     pd.datetime(2012,1,3)], ['a', 'b']],
                     labels=[[0, 1, 1, 2], [0, 0, 1, 0]], names=['date', None])
df = pd.Series(pd.np.random.rand(4), index=midx)
df.groupby(level=1)

However, it seems as though np.nan values are handled properly:

midx = pd.MultiIndex(levels=[[pd.np.nan, 10, 20], ['a', 'b']],
                     labels=[[0, 1, 1, 2], [0, 0, 1, 0]], names=['date', None])
df = pd.Series(pd.np.random.rand(4), index=midx)
df.groupby(level=1)
@shoyer shoyer added Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Bug labels Jan 13, 2015
@shoyer shoyer added this to the 0.16.0 milestone Jan 13, 2015
@shoyer
Copy link
Member

shoyer commented Jan 13, 2015

I can reproduce this on master.

Thanks for the report!

@jreback
Copy link
Contributor

jreback commented Jan 13, 2015

iirc this is a dupe issue - if someone would like 2 find the reference

@jreback
Copy link
Contributor

jreback commented Jan 13, 2015

this is covered by #6996, #6992, will xref it there.

pull-requests are welcome

@jreback
Copy link
Contributor

jreback commented Jan 13, 2015

actually, will reopen in case it is slightly different.

@jreback jreback reopened this Jan 13, 2015
@stevenmanton
Copy link
Author

Thanks for looking into this! I've been banging into this all day as I've been working on some analysis. I took a look at the pandas source, but it's not clear to me where the bug is and how to go about fixing it. Nonetheless, I've found a pretty quick workaround that produces the behavior I would expect. Maybe this will help others with a similar problem or give some direction in fixing the issue. Essentially, the workaround drops the NaT value within the level.

Here's an example of the workaround that works for me:

midx = pd.MultiIndex(levels=[[pd.NaT, pd.datetime(2012,1,2), 
                     pd.datetime(2012,1,3)], ['a', 'b']],
                     labels=[[0, 1, 1, 2], [0, 0, 1, 0]], names=['date', None])
df = pd.Series(pd.np.random.rand(4), index=midx)
df.groupby(df.index.get_level_values(0)).count()

@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@mroeschke
Copy link
Member

Looks to be fixed on master. I imagine this edge case could use a test.

In [8]: In [4]: pd.__version__
   ...: Out[4]: '0.25.0.dev0+85.g0eddba883'

In [9]: In [7]: midx = pd.MultiIndex(levels=[[pd.NaT, pd.datetime(2012,1,2),
   ...:    ...:                      pd.datetime(2012,1,3)], ['a', 'b']],
   ...:    ...:                      labels=[[0, 1, 1, 2], [0, 0, 1, 0]], names=['date', None])
   ...:    ...: df = pd.Series(pd.np.random.rand(4), index=midx)
   ...:    ...: df.groupby(level=1).mean()
/anaconda3/envs/pandas-dev/bin/ipython:3: FutureWarning: the 'labels' keyword is deprecated, use 'codes' instead
  # -*- coding: utf-8 -*-
Out[9]:
a    0.849207
b    0.877276
dtype: float64

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Duplicate Report Duplicate issue or pull request labels Feb 8, 2019
@TrigonaMinima
Copy link

@mroeschke
tried the code on the pandas version 0.24.0, it ran successfully. Could you point me to the file where this test should be added?

@mroeschke
Copy link
Member

pandas/tests/groupby/test_groupby.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug good first issue Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants