groupby enumerate method #4646

hayd · 2013-08-23T00:05:27Z

I'm not sure what a good word for this is (count is taken, order means sort)!

But it's quite an often used thing to create a column which enumerates the items in each group / counts their occurrences.

You can hack it:

In [1]: df = pd.DataFrame([[1, 2], [2, 3], [1, 4], [1, 5], [2, 6]])

In [2]: g = df.groupby(0)

In [3]: g.apply(lambda x: pd.Series(np.arange(len(x)), x.index))
Out[3]:
0    0
1    0
2    1
3    2
4    1
dtype: int64

In [5]: df['order'] = _

In [6]: df
Out[6]:
   0  1  order
0  1  2      0
1  2  3      0
2  1  4      1
3  1  5      2
4  2  6      1

I've seen this in a few SO questions, here's just one.

cc @cpcloud (and I've seen @jreback answer a question with this)

The text was updated successfully, but these errors were encountered:

jreback · 2013-08-23T00:13:24Z

counter() ?

hayd · 2013-08-23T00:33:44Z

I kind of favour enumerate() as it's more in line with python (and it reminds me of Demolition Man).

I was wrong: count isn't taken for groupby, it's size, but I'm -1 on same/similarly named methods with different meaning across groupby/DataFrame.

cpcloud · 2013-08-23T01:17:44Z

plus with enumerate maybe you could provide an iterator or column to cycle through..

hayd · 2013-08-23T12:14:46Z

@cpcloud Interesting idea, not sure how it'd work (if you pass in an iterator it'd be eaten by the first thing/start part way through?). But perhaps mapping after the enumeration would be the way to go (mod and apply a getitem)?

In [21]: df['order'].apply(list('abc').__getitem__)  # continued from above
Out[21]:
0    a
1    a
2    b
3    c
4    b
Name: order, dtype: object

In [22]: (df['order'] % 2).apply(list('ab').__getitem__)  # I guess this is what you mean by cycle
Out[22]:
0    a
1    a
2    b
3    a
4    b
Name: order, dtype: object

To implement is there a more efficient way? Or should it just be a shortcut:

g.apply(lambda x: pd.Series(np.arange(len(x)), x.index))

hayd · 2013-10-04T16:25:01Z

I wonder if enumerate could also mean enumerate by the group it's in (or does that not make sense?) maybe that could be a possible kwarg

jtratner · 2013-10-04T17:43:51Z

enumerate is something specific in Python (iterator with index), might want
to pick a different name.

As I read through I had this question: how is this different from
value_counts()?

cpcloud · 2013-10-04T17:44:47Z

it's enumerating the integer index of the elements in a group

cpcloud · 2013-10-04T17:45:05Z

sort of like a cumvalue_counts()

cpcloud · 2013-10-04T17:45:36Z

(which is a horrible name for this)

hayd · 2013-10-04T22:05:31Z

I think enumerate is the right name for this because it's something specific to python... I really like this name.

Was wondering if another potentially useful/expected thing for enumerate to do is enumerate the groups (I'm not sure), a little bit like this:

from itertools import count
c = count(-1)
df = pd.DataFrame([['a', 2], ['b', 3], ['a', 4], ['a', 5], ['b', 6]])

In [4]: df
Out[4]: 
   0  1
0  a  2
1  b  3
2  a  4
3  a  5
4  b  6

In [5]: g = df.groupby(0)

In [6]: g.apply(lambda x: pd.Series(c.next(), x.index))
Out[23]: 
0    0
1    1
2    0
3    0
4    1
dtype: int64

hayd · 2013-10-04T22:05:59Z

Although potentially the groups are not well ordered anyway....

hayd · 2013-10-04T22:24:24Z

(I think they are ordered, so this should be ok, I wonder if these two alternatives could be distinguished with a well chosen kwarg...)

ifmihai · 2013-11-12T18:11:07Z

My point of view,
as an average python/pandas user,

a name with the verb 'to count' is the most intuitive
because you count the apparitions
I take the first value in a Series, let's say value is 6,
and then I parse visually the Series counting the apparitions of 6

enumerate in python is not counting
it gets every element with its index
so it's not intuitive at all from my point of view
it's quite confusing actually

hayd · 2013-11-12T18:44:46Z

I take the point that it is somewhat confusing, but thing that any other option will be significantly more so.

This is enumerating each of the "apparitions". count is the wrong word since it usually has the meaning (elsewhere in pandas, e.g. DataFrame .count) of counting the total occurrences (i.e. groupby .size). We are not counting, we are enumerating.

enumerate in python is not counting it gets every element with its index

If you replace index with occurrence, this is exactly what we are doing.

jreback · 2013-11-12T18:47:51Z

how about tally, or number ?

cpcloud · 2013-11-12T20:24:10Z

I like tally, -1 on number.

ifmihai · 2013-11-12T21:22:35Z

other names:
counter
Numerate
Apparencies
Count_apparencies
Count_occurences
Occurences
Rolling_count

That's all i can think now of

ifmihai · 2013-11-12T21:29:09Z

I checked English dictionary
Apparitions and apparencies do not make any sense in English, it seems

sorry, English is not my first language

jtratner · 2013-11-12T21:33:04Z

maybe "appearances" is close to what you were thinking?

hayd · 2013-11-12T21:37:04Z

-1 on count or counter (as mentioned above). Also -1 on rolling_count since those function have windows rather than cumulative, and actually it does make sense to apply these.

I wonder if cumcount makes sense, inline with already existing cumsum etc. :s (though I dislike it mathematically)

I'm unsure about tally, it could be ok but I'm unsure if it's a bit of colloquialism (I think it's less clear than enumerate)...

TomAugspurger · 2013-11-12T22:11:09Z

I like tally. cumcount is good too.

EDIT: Actually, after thinking a bit more about cumcount I like it less. cumsum is always increasing. cumcount wouldn't be since the "current" cumulative count is jumping between groups...

cpcloud · 2013-11-12T22:29:11Z

@ifmihai I like apparitions 👻 😄

hayd · 2013-11-13T00:06:14Z

@TomAugspurger It's always increasing within each group, just like cumcount/enumerate.

cumsum is already available:

In [31]: df = pd.DataFrame([[1, 1], [2, 1], [1, 2]], columns=['A', 'B'])

In [32]: g = df.groupby('B')

In [33]: g.cumsum()
Out[33]: 
   A  B
0  1  1
1  3  2
2  1  2

In [34]: g.A.cumsum()
Out[34]: 
0    1
1    3
2    1
dtype: int64

(and is not increasing)

hayd · 2013-11-13T00:09:24Z

Actually I think tally suffers from one of issues that count etc. does, it usually means total (and not cumulative total).

If we were to go with cumcount etc we should make cumsum etc. a bit more visible...

TomAugspurger · 2013-11-13T00:27:39Z

Yep, I was wrong. Not sure what I was thinking. Should we split the difference and go with cumtally? Programming is hard.

ifmihai · 2013-11-13T07:59:51Z

@jtratner YES! appearances was the word I was searching for,
but it seems cumcount, enumerate, tally are preferred
(I put them in the order of my own preference)

From my perspective (of a foreigner) tally doesn't mean anything, especially in programming
cumcount appears the most logical up to this point, to be related to pandas also (or statistics)
but count gives me also the impression of a total, not cummulative

@cpcloud ha ha! anyway, apparitions are interesting :)

some other words that can be used (maybe):
track
reckon(?)
register(?)
mark
score
account

personally I prefer appearances or track_appearances

I guess it will not be so much a used function, so I guess the name can be longer if needed, right?

jorisvandenbossche · 2013-11-13T09:23:41Z

-1 for tally/cumtally. As a non-native English speaking person, I had never heard of that word, and I asked my colleagues and I am not alone.

If I would describe the action, I would say I number the items within the group (give them a number), so maybe number or numbering (but number has also other nuances, so maybe not so clear).
For the rest, I think cumcount is clear and in the same line with cumsum etc, so +1. And I think I like 'occurence' more than 'appearance'.

hayd · 2013-11-13T19:17:40Z

Let's go with cumcount then.

~~We should try and make these (cumsum etc) tab complete, for consistency...~~

jreback · 2013-11-13T19:23:47Z

+1 for cumcount

cpcloud · 2013-11-13T23:39:26Z

+1 here too

hayd · 2013-11-14T22:04:39Z

cumcount it is, now in master/0.13.

hayd · 2013-11-20T05:46:32Z

One annoying thing I've realised is as_index, if you pass a groupby which is as_index should it include that in the results index. Note filter doesn't (should it??), most other things do... or try to at least:

Observe:

In [9]: df = pd.DataFrame([[1, 2], [1, 4], [5, 6]], columns=['A', 'B'])

In [10]: g = df.groupby('A')  # effectively as_index=True

In [11]: g.head(1)  # MI, of groupby + df.index
Out[11]: 
     A  B
A        
1 0  1  2
5 2  5  6

In [12]: g.cumcount()  # index of df
Out[12]: 
0    0
1    1
2    0
dtype: int64

thoughts?

ifmihai · 2013-11-20T15:30:05Z

So what's the question? :)
I'm not sure I follow, although I want to answer.

ps.
cumcount() will be available for pandas.Series also, right?

hayd · 2013-11-20T19:53:44Z

@ifmihai I think I'll put out this question to a more general one about as_index consistency. It's a little strange as nth does a different thing too.

Difference is when you look at the index of the above results, the head has A prepended to the index...

cumcount is not a Series method, what were you thinking it would do? sugar for s.groupby(s).cumcount() ?

jreback · 2013-11-20T20:39:52Z

@hayd when you have a chance, can you add a versionadded tag to the docs for this?

hayd · 2013-11-20T23:22:10Z

I was wondering if index should be:

In [27]: g.cumcount()
Out[27]: 
A   
1  0    0
   1    1
5  2    0
dtype: int64

This came up as I was trying to tweak nth, but got into a muddle with what that does to get its index.

ifmihai · 2013-11-21T11:28:12Z

@hayd I use a separate function now, like cumcount(), to count a Series, or a column in a df. I wasn't even thinking about df.groupby() up to this thread. Right now I don't see the use too much through groupby(), as I don't have user cases in mind.

Now back to the original question, with as_index,
as a vote,
the index seems more natural to include A,
but I'm not sure 100%

ps. I cannot work with 0.13 right now (I don't know how to play with separate environments)

ghost assigned hayd Oct 22, 2013

hayd mentioned this issue Nov 14, 2013

ENH add cumcount groupby method #5510

Merged

hayd closed this as completed in #5510 Nov 14, 2013

hayd mentioned this issue Nov 20, 2013

DOC version added 0.13 to cumcount #5559

Merged

jorisvandenbossche mentioned this issue Nov 29, 2013

cumsum sums the groupby column #5614

Closed

toobaz mentioned this issue Feb 9, 2015

Undocumented API changes in groupby.apply #9447

Closed

groupby enumerate method #4646

groupby enumerate method #4646

Comments

hayd commented Aug 23, 2013

jreback commented Aug 23, 2013

hayd commented Aug 23, 2013

cpcloud commented Aug 23, 2013

hayd commented Aug 23, 2013

hayd commented Oct 4, 2013

jtratner commented Oct 4, 2013

cpcloud commented Oct 4, 2013

cpcloud commented Oct 4, 2013

cpcloud commented Oct 4, 2013

hayd commented Oct 4, 2013

hayd commented Oct 4, 2013

hayd commented Oct 4, 2013

ifmihai commented Nov 12, 2013

hayd commented Nov 12, 2013

jreback commented Nov 12, 2013

cpcloud commented Nov 12, 2013

ifmihai commented Nov 12, 2013

ifmihai commented Nov 12, 2013

jtratner commented Nov 12, 2013

hayd commented Nov 12, 2013

TomAugspurger commented Nov 12, 2013

cpcloud commented Nov 12, 2013

hayd commented Nov 13, 2013

hayd commented Nov 13, 2013

TomAugspurger commented Nov 13, 2013

ifmihai commented Nov 13, 2013

jorisvandenbossche commented Nov 13, 2013

hayd commented Nov 13, 2013

jreback commented Nov 13, 2013

cpcloud commented Nov 13, 2013

hayd commented Nov 14, 2013

hayd commented Nov 20, 2013

ifmihai commented Nov 20, 2013

hayd commented Nov 20, 2013

jreback commented Nov 20, 2013

hayd commented Nov 20, 2013

ifmihai commented Nov 21, 2013