Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

groupby enumerate method #4646

Closed
hayd opened this issue Aug 23, 2013 · 37 comments · Fixed by #5510
Closed

groupby enumerate method #4646

hayd opened this issue Aug 23, 2013 · 37 comments · Fixed by #5510

Comments

@hayd
Copy link
Contributor

hayd commented Aug 23, 2013

I'm not sure what a good word for this is (count is taken, order means sort)!

But it's quite an often used thing to create a column which enumerates the items in each group / counts their occurrences.

You can hack it:

In [1]: df = pd.DataFrame([[1, 2], [2, 3], [1, 4], [1, 5], [2, 6]])

In [2]: g = df.groupby(0)

In [3]: g.apply(lambda x: pd.Series(np.arange(len(x)), x.index))
Out[3]:
0    0
1    0
2    1
3    2
4    1
dtype: int64

In [5]: df['order'] = _

In [6]: df
Out[6]:
   0  1  order
0  1  2      0
1  2  3      0
2  1  4      1
3  1  5      2
4  2  6      1

I've seen this in a few SO questions, here's just one.

cc @cpcloud (and I've seen @jreback answer a question with this)

@jreback
Copy link
Contributor

jreback commented Aug 23, 2013

counter() ?

@hayd
Copy link
Contributor Author

hayd commented Aug 23, 2013

I kind of favour enumerate() as it's more in line with python (and it reminds me of Demolition Man).

I was wrong: count isn't taken for groupby, it's size, but I'm -1 on same/similarly named methods with different meaning across groupby/DataFrame.

@cpcloud
Copy link
Member

cpcloud commented Aug 23, 2013

plus with enumerate maybe you could provide an iterator or column to cycle through..

@hayd
Copy link
Contributor Author

hayd commented Aug 23, 2013

@cpcloud Interesting idea, not sure how it'd work (if you pass in an iterator it'd be eaten by the first thing/start part way through?). But perhaps mapping after the enumeration would be the way to go (mod and apply a getitem)?

In [21]: df['order'].apply(list('abc').__getitem__)  # continued from above
Out[21]:
0    a
1    a
2    b
3    c
4    b
Name: order, dtype: object

In [22]: (df['order'] % 2).apply(list('ab').__getitem__)  # I guess this is what you mean by cycle
Out[22]:
0    a
1    a
2    b
3    a
4    b
Name: order, dtype: object

To implement is there a more efficient way? Or should it just be a shortcut:

g.apply(lambda x: pd.Series(np.arange(len(x)), x.index))

@hayd
Copy link
Contributor Author

hayd commented Oct 4, 2013

I wonder if enumerate could also mean enumerate by the group it's in (or does that not make sense?) maybe that could be a possible kwarg

@jtratner
Copy link
Contributor

jtratner commented Oct 4, 2013

enumerate is something specific in Python (iterator with index), might want
to pick a different name.

As I read through I had this question: how is this different from
value_counts()?

@cpcloud
Copy link
Member

cpcloud commented Oct 4, 2013

it's enumerating the integer index of the elements in a group

@cpcloud
Copy link
Member

cpcloud commented Oct 4, 2013

sort of like a cumvalue_counts()

@cpcloud
Copy link
Member

cpcloud commented Oct 4, 2013

(which is a horrible name for this)

@hayd
Copy link
Contributor Author

hayd commented Oct 4, 2013

I think enumerate is the right name for this because it's something specific to python... I really like this name.

Was wondering if another potentially useful/expected thing for enumerate to do is enumerate the groups (I'm not sure), a little bit like this:

from itertools import count
c = count(-1)
df = pd.DataFrame([['a', 2], ['b', 3], ['a', 4], ['a', 5], ['b', 6]])

In [4]: df
Out[4]: 
   0  1
0  a  2
1  b  3
2  a  4
3  a  5
4  b  6

In [5]: g = df.groupby(0)

In [6]: g.apply(lambda x: pd.Series(c.next(), x.index))
Out[23]: 
0    0
1    1
2    0
3    0
4    1
dtype: int64

@hayd
Copy link
Contributor Author

hayd commented Oct 4, 2013

Although potentially the groups are not well ordered anyway....

@hayd
Copy link
Contributor Author

hayd commented Oct 4, 2013

(I think they are ordered, so this should be ok, I wonder if these two alternatives could be distinguished with a well chosen kwarg...)

@ghost ghost assigned hayd Oct 22, 2013
@ifmihai
Copy link

ifmihai commented Nov 12, 2013

My point of view,
as an average python/pandas user,

a name with the verb 'to count' is the most intuitive
because you count the apparitions
I take the first value in a Series, let's say value is 6,
and then I parse visually the Series counting the apparitions of 6

enumerate in python is not counting
it gets every element with its index
so it's not intuitive at all from my point of view
it's quite confusing actually

@hayd
Copy link
Contributor Author

hayd commented Nov 12, 2013

I take the point that it is somewhat confusing, but thing that any other option will be significantly more so.

This is enumerating each of the "apparitions". count is the wrong word since it usually has the meaning (elsewhere in pandas, e.g. DataFrame .count) of counting the total occurrences (i.e. groupby .size). We are not counting, we are enumerating.

enumerate in python is not counting it gets every element with its index

If you replace index with occurrence, this is exactly what we are doing.

@jreback
Copy link
Contributor

jreback commented Nov 12, 2013

how about tally, or number ?

@cpcloud
Copy link
Member

cpcloud commented Nov 12, 2013

I like tally, -1 on number.

@ifmihai
Copy link

ifmihai commented Nov 12, 2013

other names:
counter
Numerate
Apparencies
Count_apparencies
Count_occurences
Occurences
Rolling_count

That's all i can think now of

@ifmihai
Copy link

ifmihai commented Nov 12, 2013

I checked English dictionary
Apparitions and apparencies do not make any sense in English, it seems

sorry, English is not my first language

@jtratner
Copy link
Contributor

maybe "appearances" is close to what you were thinking?

@hayd
Copy link
Contributor Author

hayd commented Nov 12, 2013

-1 on count or counter (as mentioned above). Also -1 on rolling_count since those function have windows rather than cumulative, and actually it does make sense to apply these.

I wonder if cumcount makes sense, inline with already existing cumsum etc. :s (though I dislike it mathematically)

I'm unsure about tally, it could be ok but I'm unsure if it's a bit of colloquialism (I think it's less clear than enumerate)...

@TomAugspurger
Copy link
Contributor

I like tally. cumcount is good too.

EDIT: Actually, after thinking a bit more about cumcount I like it less. cumsum is always increasing. cumcount wouldn't be since the "current" cumulative count is jumping between groups...

@cpcloud
Copy link
Member

cpcloud commented Nov 12, 2013

@ifmihai I like apparitions 👻 😄

@hayd
Copy link
Contributor Author

hayd commented Nov 13, 2013

@TomAugspurger It's always increasing within each group, just like cumcount/enumerate.

cumsum is already available:

In [31]: df = pd.DataFrame([[1, 1], [2, 1], [1, 2]], columns=['A', 'B'])

In [32]: g = df.groupby('B')

In [33]: g.cumsum()
Out[33]: 
   A  B
0  1  1
1  3  2
2  1  2

In [34]: g.A.cumsum()
Out[34]: 
0    1
1    3
2    1
dtype: int64

(and is not increasing)

@hayd
Copy link
Contributor Author

hayd commented Nov 13, 2013

Actually I think tally suffers from one of issues that count etc. does, it usually means total (and not cumulative total).

If we were to go with cumcount etc we should make cumsum etc. a bit more visible...

@TomAugspurger
Copy link
Contributor

Yep, I was wrong. Not sure what I was thinking. Should we split the difference and go with cumtally? Programming is hard.

@ifmihai
Copy link

ifmihai commented Nov 13, 2013

@jtratner YES! appearances was the word I was searching for,
but it seems cumcount, enumerate, tally are preferred
(I put them in the order of my own preference)

From my perspective (of a foreigner) tally doesn't mean anything, especially in programming
cumcount appears the most logical up to this point, to be related to pandas also (or statistics)
but count gives me also the impression of a total, not cummulative

@cpcloud ha ha! anyway, apparitions are interesting :)

some other words that can be used (maybe):
track
reckon(?)
register(?)
mark
score
account

personally I prefer appearances or track_appearances

I guess it will not be so much a used function, so I guess the name can be longer if needed, right?

@jorisvandenbossche
Copy link
Member

-1 for tally/cumtally. As a non-native English speaking person, I had never heard of that word, and I asked my colleagues and I am not alone.

If I would describe the action, I would say I number the items within the group (give them a number), so maybe number or numbering (but number has also other nuances, so maybe not so clear).
For the rest, I think cumcount is clear and in the same line with cumsum etc, so +1. And I think I like 'occurence' more than 'appearance'.

@hayd
Copy link
Contributor Author

hayd commented Nov 13, 2013

Let's go with cumcount then.

We should try and make these (cumsum etc) tab complete, for consistency...

@jreback
Copy link
Contributor

jreback commented Nov 13, 2013

+1 for cumcount

@cpcloud
Copy link
Member

cpcloud commented Nov 13, 2013

+1 here too

@hayd
Copy link
Contributor Author

hayd commented Nov 14, 2013

cumcount it is, now in master/0.13.

@hayd
Copy link
Contributor Author

hayd commented Nov 20, 2013

One annoying thing I've realised is as_index, if you pass a groupby which is as_index should it include that in the results index. Note filter doesn't (should it??), most other things do... or try to at least:

Observe:

In [9]: df = pd.DataFrame([[1, 2], [1, 4], [5, 6]], columns=['A', 'B'])

In [10]: g = df.groupby('A')  # effectively as_index=True

In [11]: g.head(1)  # MI, of groupby + df.index
Out[11]: 
     A  B
A        
1 0  1  2
5 2  5  6

In [12]: g.cumcount()  # index of df
Out[12]: 
0    0
1    1
2    0
dtype: int64

thoughts?

@ifmihai
Copy link

ifmihai commented Nov 20, 2013

So what's the question? :)
I'm not sure I follow, although I want to answer.

ps.
cumcount() will be available for pandas.Series also, right?

@hayd
Copy link
Contributor Author

hayd commented Nov 20, 2013

@ifmihai I think I'll put out this question to a more general one about as_index consistency. It's a little strange as nth does a different thing too.

Difference is when you look at the index of the above results, the head has A prepended to the index...

cumcount is not a Series method, what were you thinking it would do? sugar for s.groupby(s).cumcount() ?

@jreback
Copy link
Contributor

jreback commented Nov 20, 2013

@hayd when you have a chance, can you add a versionadded tag to the docs for this?

@hayd
Copy link
Contributor Author

hayd commented Nov 20, 2013

I was wondering if index should be:

In [27]: g.cumcount()
Out[27]: 
A   
1  0    0
   1    1
5  2    0
dtype: int64

This came up as I was trying to tweak nth, but got into a muddle with what that does to get its index.

@ifmihai
Copy link

ifmihai commented Nov 21, 2013

@hayd I use a separate function now, like cumcount(), to count a Series, or a column in a df. I wasn't even thinking about df.groupby() up to this thread. Right now I don't see the use too much through groupby(), as I don't have user cases in mind.

Now back to the original question, with as_index,
as a vote,
the index seems more natural to include A,
but I'm not sure 100%

ps. I cannot work with 0.13 right now (I don't know how to play with separate environments)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants