-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: add return_inverse to duplicated for DataFrame/Series/Index/MultiIndex #21645
Conversation
Can you add a whatsnew entry to 0.24.0? Also, looks like we have asvs for pandas/asv_bench/benchmarks/frame_methods.py Lines 412 to 429 in 838e8da
|
pandas/core/frame.py
Outdated
compatible with ``return_inverse``. | ||
return_inverse boolean, default False | ||
Determines whether the mapping from unique elements to the original | ||
index should be returned. If true, the output is a tuple. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a .. versionadded:: 0.24.0
to this?
import pandas.util.testing as tm | ||
|
||
|
||
class TestDataFrameDuplicated(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these tests should be in tests/frame/test_analytics.py
. It looks like there are a couple DataFrame.duplicated
tests related to bugs there, though nothing related to basic behavior, e.g.
pandas/pandas/tests/frame/test_analytics.py
Lines 1634 to 1649 in 4c4579d
@pytest.mark.slow | |
def test_duplicated_do_not_fail_on_wide_dataframes(self): | |
# gh-21524 | |
# Given the wide dataframe with a lot of columns | |
# with different (important!) values | |
data = {'col_{0:02d}'.format(i): np.random.randint(0, 1000, 30000) | |
for i in range(100)} | |
df = pd.DataFrame(data).T | |
result = df.duplicated() | |
# Then duplicates produce the bool pd.Series as a result | |
# and don't fail during calculation. | |
# Actual values doesn't matter here, though usually | |
# it's all False in this case | |
assert isinstance(result, pd.Series) | |
assert result.dtype == np.bool |
The Series.duplicated
tests are also in the associated tests/series/tests_analytics.py
file, so makes sense to mirror that setup.
|
||
# keep = 'first' | ||
exp = Series([False, False, True, False, True]) | ||
assert_series_equal(df.duplicated(keep='first'), exp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My preference would be to use a slightly more explicit pattern throughout these tests along the lines of:
expected = ...
result = ...
assert_series_equal(expected, result)
Codecov Report
@@ Coverage Diff @@
## master #21645 +/- ##
==========================================
- Coverage 92.18% 92.04% -0.15%
==========================================
Files 169 169
Lines 50820 50818 -2
==========================================
- Hits 46850 46776 -74
- Misses 3970 4042 +72
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR
|
||
class TestDataFrameDuplicated(object): | ||
|
||
def test_duplicated_keep(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you parametrize this test?
df = DataFrame({'A': [0, 1, 1, 2, 0], 'B': ['a', 'b', 'b', 'c', 'a']}) | ||
|
||
# keep = 'first' | ||
exp = Series([False, False, True, False, True]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a pseudo-standard would be preferable to name this expected
instead of exp
|
||
# keep = 'first' | ||
exp = Series([False, False, True, False, True]) | ||
assert_series_equal(df.duplicated(keep='first'), exp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Along the same lines do result = df.duplicated(...)
on a previous line the simply assert_series_equal(result, expected)
assert_series_equal(df.duplicated(keep=False), exp) | ||
|
||
def test_duplicated_nan_none(self): | ||
# np.nan and None are considered equal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm is this expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is how it works currently - stuff like this becomes visible when writing tests. ;-)
Of course someone can open a separate issue on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened #21720
assert_frame_equal(reconstr, df) | ||
|
||
# keep = False | ||
rgx = 'The parameters return_inverse=True and keep=False cannot be.*' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make this a separate test called test_duplicated_inverse_raises
|
||
Returns | ||
------- | ||
duplicated : Series | ||
duplicated : Series or tuple of Series if return_inverse is True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the design consideration in going (Series, Series)
instead of say (Series, ndarray)
? Wondering if the latter wouldn't be preferable since it AFAICT it would mostly be used for indexing operations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's absolutely essential that it's a Series (otherwise it'd have to be two arrays at least), because - other than in numpy - we also need to keep track of the original index.
So, the second Series here links two arrays - which original index gets mapped to which index in the deduplicated one. This is how the reconstruction can succeed:
df == unique.reindex(inv.values).set_index(inv.index)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added examples to the docstrings that illustrate the usage (as well as the need to keep track of the index of the deduplicated object in case of Series/DataFrame).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The index is still available in the first output, so it is not that essential that the second output is a Series (there is still a design issue for sure)
pandas/core/frame.py
Outdated
ids = ids[::-1] # np.unique takes first occurrence as unique value | ||
_, o2u, u2o = np.unique(ids, return_inverse=True, | ||
return_index=True) | ||
inv = Series(self.index[::-1][o2u][u2o][::-1], index=self.index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you express this in some other fashion or at the very least comment it? The double negative step logic is tough to understand at least at first glance
@jschendel @WillAyd Another question that I realised while writing the whatsnew entry - should this kwarg be made available to Series/Index as well? In this case I'm guessing I'd have to adapt |
@jschendel Re:
Haven't ever worked with the asv's yet - will have a look later. |
I added three runs in the ASV, here are the results against current master. I don't know why Edit: ran again on a clean, restarted machine with no background applications.
|
9083a7b
to
d3b5d0a
Compare
Hello @h-vetinari! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on August 31, 2018 at 08:00 Hours UTC |
293eace
to
094139b
Compare
# will take fastpath for no duplicates | ||
self.df2.duplicated(return_inverse=True) | ||
|
||
def time_frame_duplicated_mixed(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of the 3 separate benchmarks below, you can parameterize over the return_inverse
and keep
keyword arguments.
Here's an example: https://github.com/h-vetinari/pandas/blob/094139bffa07c101882bbfbf35a1621eadc00adc/asv_bench/benchmarks/frame_methods.py#L219-L231
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the hint, coming in the commit for the next feedback round.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it normal that the times are then not split up depending on the parameter?
I get:
[...]
[ 16.67%] ··· Running frame_methods.Duplicated.time_frame_duplicated 469ms;...
[ 33.33%] ··· Running frame_methods.Duplicated.time_frame_duplicated_mixed 125ms;...
[ 50.00%] ··· Running frame_methods.Duplicated.time_frame_duplicated_wide 203ms;...
[...]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the semicolons are splitting the time depending on the parameter. The format usually looks better with asv continuous
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mroeschke The output I posted is an extract from asv continuous
, but I hadn't looked into asv compare yet
, where the results were hiding. :)
Test failure in circleci is unrelated. Failed in
|
@mroeschke
To me, the benchmarks seem a bit unstable, otherwise I don't know where the performance differences come from. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still have yet to review the actual code
doc/source/whatsnew/v0.23.1.txt
Outdated
@@ -1,7 +1,7 @@ | |||
.. _whatsnew_0231: | |||
|
|||
v0.23.1 | |||
------- | |||
v0.23.1 (June 12, 2018) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't modify things, separate PR if you would like to change this
doc/source/whatsnew/v0.24.0.txt
Outdated
``DataFrame.duplicated`` has gained the ``return_inverse`` kwarg | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
Previously, there was no way to determine how duplicate rows in a ``DataFrame`` got mapped to the deduplicated, unique subset. This made it hard to push back |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
too much explanation. make this a sentence or (all of the text). you don't need to explain why
doc/source/whatsnew/v0.24.0.txt
Outdated
index=[1, 4, 9, 16, 25]) | ||
df | ||
isdup, inv = df.duplicated(return_inverse=True) # default: keep='first' | ||
isdup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spell out names
doc/source/whatsnew/v0.24.0.txt
Outdated
|
||
.. ipython:: python | ||
|
||
unique = df.loc[~isdup] # same as df.drop_duplicates() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this example is confusing because you are using different code here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand. It refers to the same objects as in the code block above - it's just a continuation.
As I already mentioned above, it became obvious to me while writing the whatsnew entry that this kwarg should be enabled for Series/Index as well. The last commit contains a working version but no tests yet -- ran into troubles with #21684. Additional xref #21685 This is already much more elegant than the previous commits, because it does all the work for In summary, the following is still lacking:
|
819bc9a
to
80f5881
Compare
I finally finished adding all the tests, turns out it's quite complicated to do anything for The core algorithm changes almost vanish compared to the amount of test changes. Some of the things I added tests for were not really tested so far (bare ASVs should be in line with what I posted above, because as long as I needed to implement a different signature for Tried to explain all the code as well as possible, but happy to receive feedback (and/or answer questions). |
80f5881
to
8acfe88
Compare
pandas/core/algorithms.py
Outdated
- False : Mark all duplicates as ``True``. | ||
- False : Mark all duplicates as ``True``. This option is not | ||
compatible with ``return_index`` or ``return_inverse``. | ||
return_index : boolean, default False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls pls let's keep the scope to the original. just return_inverse. If you want to do more things, make them PR's on top.
@jreback I did keep strictly with the scope of just adding
There are two options from my POV:
Does that make the situation clearer? I don't mind implementing option 1, but it would lead to code duplication and would IMHO be messier. |
8acfe88
to
b7d6077
Compare
Ugh, resolving conflicts on windows is such a pain (conflict resolution commits not recognised by git). Had to do a massive rebase... Also added some examples to the separate |
b7d6077
to
93bd97f
Compare
bd4430e
to
18a0de7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't look in detail to the implementation, but some general questions:
-
What is the reason you decided to add it to
duplicated
and notdrop_duplicates
? (because from quickly looking at the issue, and from the docstring examples, that is actually what you want?)
I also ask because I think it would be a more natural fit there, as it is not literally an "inverse" for the result ofduplicated
(as it is fordrop_duplicates
or as forunique
in numpy) -
I think we should have more discussion (or at least motivation for the current design) about the actual return value (series vs array, index labels vs positional integers). Always returning an array of integers would also be an option (and is eg robust for duplicates in the index)
@@ -159,6 +159,52 @@ This is the same behavior as ``Series.values`` for categorical data. See | |||
:ref:`whatsnew_0240.api_breaking.interval_values` for more. | |||
|
|||
|
|||
.. _whatsnew_0240.enhancements.duplicated_inverse: | |||
|
|||
The `duplicated`-method has gained the `return_inverse` kwarg |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
double backticks were you now use single backticks (also on the lines below)
(in rst, double backticks give code-styled text)
pandas/core/frame.py
Outdated
- False : Mark all duplicates as ``True``. This option is not | ||
compatible with ``return_inverse``. | ||
return_inverse : boolean, default False | ||
If True, also return the selection from the index from the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"selection from the index" sounds strange to me, and it does not really say what the values are (I understand what you mean, but I think we should think about a better wording).
Also the following lines are rather complex to understand (eg the "boolean complement of the first output")
|
||
Returns | ||
------- | ||
duplicated : Series | ||
duplicated : Series or tuple of Series if return_inverse is True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The index is still available in the first output, so it is not that essential that the second output is a Series (there is still a design issue for sure)
I wanted to add
I still think it makes sense for
numpy returns two arrays of indexes, one relative to the original set and one relative to the unique values. I think this information fundamentally belongs together and since pandas supports the index/values distinction, I strongly believe that they should live in the same object. I don't like reusing the index of |
@jorisvandenbossche |
@jreback @WillAyd @jorisvandenbossche How can I/we move this PR forward? |
@h-vetinari sorry for the slow answer I think my main worry is that we are adding a Numpy's
Can you clarify this? You are talking about Side note: it might make sense to add this to |
Fair point. I would prefer a
which is how I implemented it here. Could still be added to Re:
https://docs.scipy.org/doc/numpy/reference/generated/numpy.unique.html has both
Agreed. Actually, for me the best solution would be:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i haven't looked at this in any detail, there is a lot going on here.
pandas/core/algorithms.py
Outdated
|
||
if keep == 'first': | ||
# o2u: original indices to indices of ARRAY of unique values | ||
# u2o: reduplication from array of unique values to original array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather you not use np.unique
at all, its not as performant as pd.unique
, doesn't handle all dtypes, and sorts.
Further pls, pls use actual names here, and really avoid using abbreviations in any library code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather you not use
np.unique
at all, its not as performant aspd.unique
, doesn't handle all dtypes, and sorts.
Yes, it would be nicer to have this implemented in the cython hashtable functions, but that performance improvement is for a follow-up. np.unique
is an easy solution and is invoked only for return_inverse=True
(and we're only calling it on a series of ints, not objects, because the factorization for that hasn't changed!).
Further pls, pls use actual names here, and really avoid using abbreviations in any library code.
There's a fully commented, thoroughly explained and very localized part where these appear. Not sure how this is unclear, but will adapt...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Finally, the core changes are minimal (but I understand that it looks like a lot). TLDR: Implementation moves to core.algorithms
, everything else wraps around that (+ doc improvements). Tests are expanded, and ASVs added.
I have wanted a Anyway, yes @jreback put you on the path to add it to
I think the numpy's
Adding a |
i am not against adding this to .unique, though as you have mentioned that is indeed just drop_duplicates |
@jorisvandenbossche
I would suggest that
Assuming
and neither
I had already collected some thoughts for this during the day and opened a (largely orthogonal to this PR) issue for this: #22824 EDIT: clarified mechanics for inverse |
That is because for the inverse you rely on I see how you use both |
When I started this, I just saw Looking at the implementation of factorize, you are right, there is a I could work around the former (like I do here for
Maybe need to add a way for
Just to make sure we're on the same page: I need the |
Closes #21357
I followed the recommendation of @jreback there to change
duplicated
, and to make the output a tuple ifreturn_inverse=True
.I didn't find a test for the behavior of
DataFrame.duplicated()
-- there's a test ofIndex.duplicated()
andSeries.duplicated()
intests/test_base.py
, but nothing I could find intests/frame/
.So I added a file to test to test current behavior, as well as the new kwarg. Notably, the backend between the different Series in the output tuple is different (
pandas._libs.hashtable.duplicated_int64
vs.np.unique
), and so I also added several tests that these results are exactly compatible.