-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add groupby().ngroup() method to count groups (#11642) #14026
Conversation
Number each group from 0 to the number of groups - 1. | ||
|
||
This is the enumerative complement of cumcount. Note that the | ||
numbers given to the groups match the order in which the groups |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
versionadded tag
I think you can add this function here, so that |
doc/source/groupby.rst
Outdated
@@ -969,7 +969,7 @@ Enumerate group items | |||
.. versionadded:: 0.13.0 | |||
|
|||
To see the order in which each row appears within its group, use the | |||
``cumcount`` method: | |||
``cumcount`` method (compare with ``enumerate``): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure what this means
Isn't the name |
can you rebase / update? |
looking back on this, shouldn't we call this |
@jreback: I'll buy that. The collision with |
I agree The verb is also 'number'. But that is maybe also a bit strange to use given the use of
So this points back to enumerate .. |
how about |
pandas/core/groupby.py
Outdated
4 0 | ||
5 1 | ||
dtype: int64 | ||
>>> df = pd.DataFrame([['b'], ['a'], ['a'], ['b']], columns=['A']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a canonical way to create the df? I'd generally have used pd.DataFrame({'A': list('baab')})
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or without the list
expansion is fine too
any more thoughts on the name? does SQL call this anything? summary (and some new)
|
any further thoughts on this? |
@dsm054 thoughts on this? |
Looking back I still think that
In SQL I'd use something like DENSE_RANK() to get this. |
I like |
a6e60a7
to
7aee071
Compare
7aee071
to
966f9be
Compare
Codecov Report
@@ Coverage Diff @@
## master #14026 +/- ##
=========================================
Coverage ? 90.76%
=========================================
Files ? 161
Lines ? 51098
Branches ? 0
=========================================
Hits ? 46377
Misses ? 4721
Partials ? 0
Continue to review full report at Codecov.
|
I agree that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! (nice suite of tests!)
doc/source/groupby.rst
Outdated
|
||
df.groupby('A').ngroup() | ||
|
||
df.groupby('A').ngroup(ascending=False) # kwarg only |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can remove the "kwarg only" I think. It's not really clear to what it points, and it is also not correct I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, this was a copy/paste from the cumcount doc which I didn't remove. I don't think it holds for that either, so I'll remove them both.
doc/source/groupby.rst
Outdated
way similar to ``pd.factorize()``, but which applies naturally to multiple | ||
columns of mixed type and different sources: | ||
|
||
.. ipython::python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
space after the ::
needed
pandas/tests/groupby/test_groupby.py
Outdated
assert_series_equal(g_ngroup, expected_ngroup) | ||
assert_series_equal(g_cumcount, expected_cumcount) | ||
|
||
def test_ngroup_cumcount_pair(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you testing here something specific? Or just to have more test cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you mean test_ngroup_matches_cumcount
, yeah, it's just a specific two-column case to make sure they align. If you mean test_ngroup_cumcount_pair
, that's to make sure that the (ngroup, cumcount) pair you can assign to each row matches expectation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant the test_ngroup_cumcount_pair
. It is just not really clear to me what the test specifically tries to test in addition to the other tests
Just to be sure we have considered this: we have the question whether this should behave similar to In the second case we could always have both behaviours as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. some doc comments. ping when ready.
doc/source/groupby.rst
Outdated
|
||
.. ipython:: python | ||
|
||
df = pd.DataFrame(list('aaabba'), columns=['A']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name this df something else as it might be clobbering the other parts of the docs (.e.g dfg or something)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alternatively, maybe better is to use the same example as cumcount (change if needed to make it easier to talk about), and then you can contrast it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's already the same as the cumcount example, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh ok, just reading as a user want to know: what is the difference between these and when should I use one over other (which you explain in the doc-string, just add somewhere here)
doc/source/groupby.rst
Outdated
.. versionadded:: 0.20.0 | ||
|
||
To see the ordering of the groups themselves, you can use the ``ngroup`` | ||
method: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add a comment differentiating this from .cumcount
(when to use which)
doc/source/groupby.rst
Outdated
.. ipython::python | ||
|
||
df = pd.DataFrame({"A": [1, 1, 2, 3, 2], "B": list("aaaba")}) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you elaborate on why this is useful
doc/source/whatsnew/v0.20.0.txt
Outdated
Groupby Group Numbers | ||
^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
A new groupby method ``ngroup``, parallel to the existing ``cumcount``, has been added to return the group order (:issue:`11642`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.ngroup()
and .cumcount()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or even better: link to their docstring page, using :meth:`~pandas.core.groupby.GroupBy.ngroup`
(will only display the ngroup)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the grammar actually works better without the parentheses -- type(df.groupby("a").ngroup)
is what gives "method", after all. Should I do this to every mention of ngroup
in this doc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dsm054 usually do this in the first / most prominent mention (so here is obviously good). you certainly can do it more. I would also add the ref for cumcount here.
doc/source/whatsnew/v0.20.0.txt
Outdated
^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
A new groupby method ``ngroup``, parallel to the existing ``cumcount``, has been added to return the group order (:issue:`11642`). | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a ref back to the docs
This is the enumerative complement of cumcount. Note that the | ||
numbers given to the groups match the order in which the groups | ||
would be seen when iterating over the groupby object, not the | ||
order they are first observed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this comment is good, add something like this to the docs (in groupy.rst) where you show an example
pandas/core/groupby.py
Outdated
4 0 | ||
5 1 | ||
dtype: int64 | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add an example with multple groupers
pandas/tests/groupby/test_groupby.py
Outdated
@@ -4304,6 +4251,192 @@ def test_cummin_cummax(self): | |||
tm.assert_series_equal(expected, result) | |||
|
|||
|
|||
class TestCounting(tm.TestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perfect you moved them! even better, can you move to a separate file
pandas/tests/groupby/test_counting.py
(cumcount & ngroup). If you find other methods that are very relevant ok (note I wouldn't move .count()).
b9b484e
to
053935b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor doc comments. lgtm otherwise. ping on green.
doc/source/groupby.rst
Outdated
~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
||
By using ``.ngroup()``, we can extract information about the groups in a | ||
way similar to ``pd.factorize()``, but which applies naturally to multiple |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you make this :func:`factorize`
and point a ref link to the factorize section of docs (in reshaping IIRC0
doc/source/whatsnew/v0.20.2.txt
Outdated
@@ -21,6 +21,7 @@ Enhancements | |||
|
|||
- Unblocked access to additional compression types supported in pytables: 'blosc:blosclz, 'blosc:lz4', 'blosc:lz4hc', 'blosc:snappy', 'blosc:zlib', 'blosc:zstd' (:issue:`14478`) | |||
- ``Series`` provides a ``to_latex`` method (:issue:`16180`) | |||
- A new groupby method :meth:`~pandas.core.groupby.GroupBy.ngroup`, parallel to the existing :meth:`~pandas.core.groupby.GroupBy.cumcount`, has been added to return the group order (:issue:`11642`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a ref to the docs you just created here
053935b
to
5bb1551
Compare
5bb1551
to
7d3dd0c
Compare
7d3dd0c
to
73f0d6a
Compare
@dsm054 if you can update docs as indicated and rebase today would be great. |
73f0d6a
to
8383c54
Compare
8383c54
to
14d941b
Compare
@jreback: had to rebase this morning 'cause of that linting error you fixed. I'll keep doing so until we're green. |
test_write_fspath_all[to_hdf-writer_kwargs3-tables] failed, which doesn't seem related. |
yeah, I didn't realize that test was flaky, but I am able to reproduce it occasionally. |
thanks @dsm054 nice PR! |
…as-dev#14026) (cherry picked from commit 72e0d1f)
Version 0.20.2 * tag 'v0.20.2': (68 commits) RLS: v0.20.2 DOC: Update release.rst DOC: Whatsnew fixups (pandas-dev#16596) ERRR: Raise error in usecols when column doesn't exist but length matches (pandas-dev#16460) BUG: convert numpy strings in index names in HDF pandas-dev#13492 (pandas-dev#16444) PERF: vectorize _interp_limit (pandas-dev#16592) DOC: whatsnew 0.20.2 edits (pandas-dev#16587) API: Make is_strictly_monotonic_* private (pandas-dev#16576) BUG: reimplement MultiIndex.remove_unused_levels (pandas-dev#16565) Strictly monotonic (pandas-dev#16555) ENH: add .ngroup() method to groupby objects (pandas-dev#14026) (pandas-dev#14026) fix linting BUG: Incorrect handling of rolling.cov with offset window (pandas-dev#16244) BUG: select_as_multiple doesn't respect start/stop kwargs GH16209 (pandas-dev#16317) return empty MultiIndex for symmetrical difference on equal MultiIndexes (pandas-dev#16486) BUG: Bug in .resample() and .groupby() when aggregating on integers (pandas-dev#16549) BUG: Fixed tput output on windows (pandas-dev#16496) Strictly monotonic (pandas-dev#16555) BUG: fixed wrong order of ordered labels in pd.cut() BUG: Fixed to_html ignoring index_names parameter ...
This basically adds a method to give access to the coding returned by
grouper.group_info[0]
, i.e. the number of the group that each row is in. This is a natural parallel to cumcount(), and while it's not the world's most important feature it's come in handy for me from time to time and deserves a public method, IMHO.git diff upstream/master | flake8 --diff