nth groupby method on DataFrame #5552

hayd · 2013-11-19T22:02:44Z

The nth groupby method on a Series takes the nth non-NaN value in the Series.

This means on a DataFrame it's not going to be well defined...

Should we make this a Series only method?

In [1]: g = pd.DataFrame([['a'], ['a'], ['a'], ['b'], ['b'], ['a']],columns=['A']).groupby('A')

In [2]: g.nth(0)
Out[2]: 
Empty DataFrame
Columns: []
Index: []

In [3]: g.A.nth(0)
Out[3]: 
A
a    a
b    b
Name: A, dtype: object

The text was updated successfully, but these errors were encountered:

hayd · 2013-11-20T02:39:19Z

Maybe I just don't understand what nth does, not always empty.

cancan101 · 2013-11-20T02:49:12Z

@hayd Maybe nth on a DataFrame should take the column as an argument and then it would then solve #5503

hayd · 2013-11-20T05:13:35Z

It's also really slow: http://stackoverflow.com/a/20087789/1240268

I'm of the opinion that nth should take the nth, regardless of NaN, or take kwarg for NaN-ness. Need to think through the logic of what it is doing atm (it's only small!)

hayd · 2013-11-20T05:41:51Z

For convenience the current impl is:

def nth(self, n):
    def picker(arr):
        arr = arr[notnull(arr)]
        if len(arr) >= n + 1:
            return arr.iget(n)
        else:
            return np.nan
    return self.agg(picker)

With a frame the arr = arr[notnull(arr)] does nothing, I really don't understand how the next bit even works since iget is not a DataFrame method....!

I like the idea of a kwarg to describe this, grr to current behaviour: Series/DataFrame being inconsistent. Not sure how we should play this.

...in any case we should use cumcount, much faster.

hayd · 2013-11-20T05:57:06Z

Other possibility is plonk/override this into Series Groupby "as is*", but have it in Groupby the just take the nth row regardless of NaN, at least for now.

Also atm this is not that great, as it NaNs for groups smaller than n.

* but using cumcount.

hayd · 2014-03-05T18:58:23Z

Any thoughts on this @jreback @TomAugspurger?

Should we somehow depreciate old Series behaviour (of getting nth non-null result, I think this may be something that R does which is why we do it...)? Not sure on way forward.

I still don't understand how this even works on a DataFrame... cumcount impl much easier (confusing to be different to Series though it already is).

jreback · 2014-03-05T19:45:49Z

I think u should break the API and have a na='drop' (but default to na=None), meaning don't remove Nan's

hayd · 2014-03-05T19:50:25Z

Should we have something similar for DataFrames? Could be drop_na (think this is inline with other args), You think this arg should only be for Series or hack something which will work for Dataframes too?

jreback · 2014-03-05T19:52:10Z

I think for frames u could do any/all
equivalent of doing a dropna beforehand and just default o None ?

hayd · 2014-03-05T19:54:18Z

sure, I think that's not so bad. I will remember to use _selected_obj :)

hayd · 2014-03-06T03:23:04Z

I'm not sure on the API for this to the old behaviour (i.e. not just filtering), for one thing there are two NaNs:

NaNs because there are no more results in the group.
NaNs values from the actual DataFrame/Series.

For another these are two very different results, you only get the groupby's by info if you use the old method.

I still think should change... just not sure on API for it.

In [11]: s = pd.Series([1, 2, 3, 1, np.nan])

In [12]: s.groupby([1, 2, 3, 1, 2]).nth(1)  # the NaN has meaning here (it was missing in s)
Out[12]:
3     1
4   NaN
dtype: float64

s.groupby([1, 2, 3, 1, 2]).nth(1)  # old, has index from by, the NaN means group not that big (with dropped nas), observe that index is from by
1     1
2   NaN
3   NaN
dtype: float64

jreback · 2014-03-07T00:02:50Z

travis is now happy...so go ahead at your leisure

hayd · 2014-03-07T01:06:35Z

@jreback still unsure about the API, any thoughts. I guess the old behaviour is an R thing?

I guess it's annoying that these are two very different things:

a. filter frame and series to take just the nth rows. result is subframe/series.
b. what it's doing before (at least in the series case, for dataframe it currently doesn't work*), which is grab the nth non-null for each group or NaN if not there, and index but the groupby by. result is not sub frame/series.

* amusingly I think the iget call I was confused about raises but is caught so get NaN

jreback · 2014-03-07T14:09:28Z

I think you need to support both cases, but make default a)

df.groupby(...).nth(5,drop_na=False)

so drop_na=False is a), drop_na=True is b)

hayd mentioned this issue Nov 19, 2013

PERF faster head, tail and size groupby methods #5533

Merged

hayd mentioned this issue Jan 27, 2014

Consistency with groupby as_index #5755

Closed

8 tasks

hayd mentioned this issue Mar 7, 2014

ENH/BUG groupby nth now filters, works with DataFrames #6569

Merged

hayd closed this as completed in #6569 Mar 7, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nth groupby method on DataFrame #5552

nth groupby method on DataFrame #5552

hayd commented Nov 19, 2013

hayd commented Nov 20, 2013

cancan101 commented Nov 20, 2013

hayd commented Nov 20, 2013

hayd commented Nov 20, 2013

hayd commented Nov 20, 2013

hayd commented Mar 5, 2014

jreback commented Mar 5, 2014

hayd commented Mar 5, 2014

jreback commented Mar 5, 2014

hayd commented Mar 5, 2014

hayd commented Mar 6, 2014

jreback commented Mar 7, 2014

hayd commented Mar 7, 2014

jreback commented Mar 7, 2014

nth groupby method on DataFrame #5552

nth groupby method on DataFrame #5552

Comments

hayd commented Nov 19, 2013

hayd commented Nov 20, 2013

cancan101 commented Nov 20, 2013

hayd commented Nov 20, 2013

hayd commented Nov 20, 2013

hayd commented Nov 20, 2013

hayd commented Mar 5, 2014

jreback commented Mar 5, 2014

hayd commented Mar 5, 2014

jreback commented Mar 5, 2014

hayd commented Mar 5, 2014

hayd commented Mar 6, 2014

jreback commented Mar 7, 2014

hayd commented Mar 7, 2014

jreback commented Mar 7, 2014