Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nth groupby method on DataFrame #5552

Closed
hayd opened this issue Nov 19, 2013 · 14 comments · Fixed by #6569
Closed

nth groupby method on DataFrame #5552

hayd opened this issue Nov 19, 2013 · 14 comments · Fixed by #6569

Comments

@hayd
Copy link
Contributor

hayd commented Nov 19, 2013

The nth groupby method on a Series takes the nth non-NaN value in the Series.

This means on a DataFrame it's not going to be well defined...

Should we make this a Series only method?

In [1]: g = pd.DataFrame([['a'], ['a'], ['a'], ['b'], ['b'], ['a']],columns=['A']).groupby('A')

In [2]: g.nth(0)
Out[2]: 
Empty DataFrame
Columns: []
Index: []

In [3]: g.A.nth(0)
Out[3]: 
A
a    a
b    b
Name: A, dtype: object
@hayd
Copy link
Contributor Author

hayd commented Nov 20, 2013

Maybe I just don't understand what nth does, not always empty.

@cancan101
Copy link
Contributor

@hayd Maybe nth on a DataFrame should take the column as an argument and then it would then solve #5503

@hayd
Copy link
Contributor Author

hayd commented Nov 20, 2013

It's also really slow: http://stackoverflow.com/a/20087789/1240268

I'm of the opinion that nth should take the nth, regardless of NaN, or take kwarg for NaN-ness. Need to think through the logic of what it is doing atm (it's only small!)

@hayd
Copy link
Contributor Author

hayd commented Nov 20, 2013

For convenience the current impl is:

def nth(self, n):
    def picker(arr):
        arr = arr[notnull(arr)]
        if len(arr) >= n + 1:
            return arr.iget(n)
        else:
            return np.nan
    return self.agg(picker)

With a frame the arr = arr[notnull(arr)] does nothing, I really don't understand how the next bit even works since iget is not a DataFrame method....!

I like the idea of a kwarg to describe this, grr to current behaviour: Series/DataFrame being inconsistent. Not sure how we should play this.

...in any case we should use cumcount, much faster.

@hayd
Copy link
Contributor Author

hayd commented Nov 20, 2013

Other possibility is plonk/override this into Series Groupby "as is*", but have it in Groupby the just take the nth row regardless of NaN, at least for now.

Also atm this is not that great, as it NaNs for groups smaller than n.

* but using cumcount.

@hayd
Copy link
Contributor Author

hayd commented Mar 5, 2014

Any thoughts on this @jreback @TomAugspurger?

Should we somehow depreciate old Series behaviour (of getting nth non-null result, I think this may be something that R does which is why we do it...)? Not sure on way forward.

I still don't understand how this even works on a DataFrame... cumcount impl much easier (confusing to be different to Series though it already is).

@jreback
Copy link
Contributor

jreback commented Mar 5, 2014

I think u should break the API and have a na='drop' (but default to na=None), meaning don't remove Nan's

@hayd
Copy link
Contributor Author

hayd commented Mar 5, 2014

Should we have something similar for DataFrames? Could be drop_na (think this is inline with other args), You think this arg should only be for Series or hack something which will work for Dataframes too?

@jreback
Copy link
Contributor

jreback commented Mar 5, 2014

I think for frames u could do any/all
equivalent of doing a dropna beforehand and just default o None ?

@hayd
Copy link
Contributor Author

hayd commented Mar 5, 2014

sure, I think that's not so bad. I will remember to use _selected_obj :)

@hayd
Copy link
Contributor Author

hayd commented Mar 6, 2014

I'm not sure on the API for this to the old behaviour (i.e. not just filtering), for one thing there are two NaNs:

  • NaNs because there are no more results in the group.
  • NaNs values from the actual DataFrame/Series.

For another these are two very different results, you only get the groupby's by info if you use the old method.

I still think should change... just not sure on API for it.

In [11]: s = pd.Series([1, 2, 3, 1, np.nan])

In [12]: s.groupby([1, 2, 3, 1, 2]).nth(1)  # the NaN has meaning here (it was missing in s)
Out[12]:
3     1
4   NaN
dtype: float64

s.groupby([1, 2, 3, 1, 2]).nth(1)  # old, has index from by, the NaN means group not that big (with dropped nas), observe that index is from by
1     1
2   NaN
3   NaN
dtype: float64

@jreback
Copy link
Contributor

jreback commented Mar 7, 2014

travis is now happy...so go ahead at your leisure

@hayd
Copy link
Contributor Author

hayd commented Mar 7, 2014

@jreback still unsure about the API, any thoughts. I guess the old behaviour is an R thing?

I guess it's annoying that these are two very different things:

a. filter frame and series to take just the nth rows. result is subframe/series.
b. what it's doing before (at least in the series case, for dataframe it currently doesn't work*), which is grab the nth non-null for each group or NaN if not there, and index but the groupby by. result is not sub frame/series.

* amusingly I think the iget call I was confused about raises but is caught so get NaN

@jreback
Copy link
Contributor

jreback commented Mar 7, 2014

I think you need to support both cases, but make default a)

df.groupby(...).nth(5,drop_na=False)

so drop_na=False is a), drop_na=True is b)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants