-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: add return_inverse to duplicated for DataFrame/Series/Index/MultiIndex #21645
Changes from 7 commits
ebfffc2
6e005c2
5c35c8d
16c3103
18a0de7
aa36fff
db8693f
d4e803e
738048e
b8cbb13
2906b70
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -159,6 +159,52 @@ This is the same behavior as ``Series.values`` for categorical data. See | |
:ref:`whatsnew_0240.api_breaking.interval_values` for more. | ||
|
||
|
||
.. _whatsnew_0240.enhancements.duplicated_inverse: | ||
|
||
The `duplicated`-method has gained the `return_inverse` kwarg | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
The ``duplicated``-method for ``Series``, ``DataFrame`` and all flavors of ``Index`` has gained a ``return_inverse`` keyword, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Double backticks are for literals and code samples, so let's put There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK. I have to admit I was always confused by the difference between single/double backticks in rst... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I see I commented now again the reverse .. In docstrings we do use single backticks to refer to parameter names. |
||
which is ``False`` by default. Specifying ``return_inverse=True`` will add an object to the output (which therefore becomes a tuple) | ||
that allows reconstructing the original object from the deduplicated, unique subset (:issue:`21357`). | ||
|
||
For ``Index`` objects, the inverse is an ``np.ndarray``: | ||
|
||
.. ipython:: python | ||
|
||
idx = pd.Index(['a', 'b', 'b', 'c', 'a']) | ||
isduplicate, inverse = idx.duplicated(return_inverse=True) # default: keep='first' | ||
isduplicate | ||
inverse | ||
|
||
This allows to reconstruct the original ``Index`` as follows: | ||
|
||
.. ipython:: python | ||
|
||
unique = idx[~isduplicate] # same as idx.drop_duplicates() | ||
unique | ||
|
||
reconstruct = unique[inverse] | ||
reconstruct.equals(idx) | ||
|
||
For ``DataFrame`` and ``Series`` the inverse needs to take into account the original index as well, and is therefore a ``Series``, | ||
which contains the mapping from the index of the deduplicated, unique subset back to the original index. | ||
|
||
.. ipython:: python | ||
|
||
df = pd.DataFrame({'A': [0, 1, 1, 2, 0], 'B': ['a', 'b', 'b', 'c', 'a']}, | ||
index=[1, 4, 9, 16, 25]) | ||
df | ||
isduplicate, inverse = df.duplicated(keep='last', return_inverse=True) | ||
isduplicate | ||
inverse | ||
|
||
unique = df.loc[~isduplicate] # same as df.drop_duplicates(keep='last') | ||
unique | ||
reconstruct = unique.reindex(inverse.values).set_index(inverse.index) | ||
reconstruct.equals(df) | ||
|
||
|
||
.. _whatsnew_0240.enhancements.other: | ||
|
||
Other Enhancements | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -770,7 +770,7 @@ def _value_counts_arraylike(values, dropna): | |
return keys, counts | ||
|
||
|
||
def duplicated(values, keep='first'): | ||
def duplicated(values, keep='first', return_inverse=False): | ||
""" | ||
Return boolean ndarray denoting duplicate values. | ||
|
||
|
@@ -785,16 +785,67 @@ def duplicated(values, keep='first'): | |
occurrence. | ||
- ``last`` : Mark duplicates as ``True`` except for the last | ||
occurrence. | ||
- False : Mark all duplicates as ``True``. | ||
- False : Mark all duplicates as ``True``. This option is not | ||
compatible with ``return_inverse``. | ||
return_inverse : boolean, default False | ||
If True, also return the selection of (integer) indices from the array | ||
of unique values (created e.g. by selecting the boolean complement of | ||
the first output, or by using `.drop_duplicates` with the same | ||
`keep`-parameter) that can be used to reconstruct "values". | ||
|
||
.. versionadded:: 0.24.0 | ||
|
||
Returns | ||
------- | ||
duplicated : ndarray | ||
duplicated : ndarray or tuple of ndarray if ``return_inverse`` is True | ||
""" | ||
|
||
if return_inverse and keep is False: | ||
raise ValueError("The parameters return_inverse=True and " | ||
"keep=False cannot be used together (impossible " | ||
"to calculate an inverse when discarding all " | ||
"instances of a duplicate).") | ||
|
||
values, dtype, ndtype = _ensure_data(values) | ||
f = getattr(htable, "duplicated_{dtype}".format(dtype=ndtype)) | ||
return f(values, keep=keep) | ||
isdup = f(values, keep=keep) | ||
if not return_inverse: | ||
return isdup | ||
elif not isdup.any(): | ||
# no need to calculate inverse if no duplicates | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this always going to hold true? For example, if we work with a Series that is not sequentially indexed starting at 0 but doesn't contain duplicates is this going to return the appropriate result? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is just the base-version of |
||
inv = np.arange(len(values)) | ||
return isdup, inv | ||
|
||
if keep == 'first': | ||
# o2u: original indices to indices of ARRAY of unique values | ||
# u2o: reduplication from array of unique values to original array | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd rather you not use Further pls, pls use actual names here, and really avoid using abbreviations in any library code. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes, it would be nicer to have this implemented in the cython hashtable functions, but that performance improvement is for a follow-up.
There's a fully commented, thoroughly explained and very localized part where these appear. Not sure how this is unclear, but will adapt... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Finally, the core changes are minimal (but I understand that it looks like a lot). TLDR: Implementation moves to |
||
# this fits together in the way that values[o2u] are the unique values | ||
# and values[o2u][u2o] == values | ||
_, o2u, u2o = np.unique(values, return_index=True, | ||
return_inverse=True) | ||
elif keep == 'last': | ||
# np.unique takes first occurrence as unique value, | ||
# so we flip values that first becomes last | ||
values = values[::-1] | ||
_, o2u, u2o = np.unique(values, return_index=True, | ||
return_inverse=True) | ||
# the values in "values" correspond(ed) to the index of "values", | ||
# which is simply np.arange(len(values)). | ||
# By flipping "values" around, we need to do the same for the index, | ||
# ___because o2u and u2o are relative to that order___. | ||
# Finally, to fit with the original order again, we need to flip the | ||
# result around one last time. | ||
o2u, u2o = np.arange(len(values))[::-1][o2u], u2o[::-1] | ||
|
||
# np.unique yields a ___sorted___ list of uniques, and o2u/u2o are relative | ||
# to this order. To restore the original order, we argsort o2u, because o2u | ||
# would be ordered if np.unique had not sorted implicitly. The first | ||
# argsort gives the permutation from o2u to its sorted form, but we need | ||
# the inverse permutation (the map from the unsorted uniques to o2u, from | ||
# which we can continue with u2o). This inversion (as a permutation) is | ||
# achieved by the second argsort. | ||
inv = np.argsort(np.argsort(o2u))[u2o] | ||
return isdup, inv | ||
|
||
|
||
def mode(values, dropna=True): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
double backticks were you now use single backticks (also on the lines below)
(in rst, double backticks give code-styled text)