Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add ignore_index for df.drop_duplicates #30405

Merged
merged 15 commits into from
Dec 27, 2019

Conversation

charlesdong1991
Copy link
Member

@charlesdong1991 charlesdong1991 commented Dec 22, 2019

@gfyoung gfyoung added API Design Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Dec 22, 2019
@@ -4606,6 +4607,8 @@ def drop_duplicates(
- False : Drop all duplicates.
inplace : bool, default False
Whether to drop duplicates in place or to return a copy.
ignore_index : bool, default False
If True, the resulting axis will be labeled 0, …, n - 1.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If True, the resulting axis will be labeled 0, …, n - 1.
If True, the resulting axis will be labeled 0, 1, …, n - 1.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, changed!

self._update_inplace(new_data)
else:
if ignore_index:
idx = ibase.default_index(len(self[-duplicated]))
Copy link
Member

@WillAyd WillAyd Dec 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this block could be more succinctly written as:

result = self[~duplicated]
if ignore_index:
    result = result.reset_index(drop=True)

return result

Or something similar. FWIW I think the current method of evaluating self[~duplicated] twice can be costly for larger frames

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh, good call! @WillAyd

will change! i think idx = ibase.default_index(sum(-duplicated)) should be also faster than using self[-duplicated]

@@ -474,7 +474,7 @@ Other API changes
Supplying anything else than ``how`` to ``**kwargs`` raised a ``TypeError`` previously (:issue:`29388`)
- When testing pandas, the new minimum required version of pytest is 5.0.1 (:issue:`29664`)
- :meth:`Series.str.__iter__` was deprecated and will be removed in future releases (:issue:`28277`).

- Add ``ignore_index`` to :meth:`DataFrame.drop_duplicates` to reset index (:issue:`30114`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reverse the ordering here, e.g. .drop_duplicates has gained the ignore_index keyword.

move to other enhancements

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved and rephrased!

pandas/core/frame.py Show resolved Hide resolved
@@ -4621,9 +4624,15 @@ def drop_duplicates(
if inplace:
(inds,) = (-duplicated)._ndarray_values.nonzero()
new_data = self._data.take(inds)

if ignore_index:
new_data = new_data.reset_index(drop=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't use reset_index, simply use what you did in the other PR, default_index. .reset_index() causes another copy here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, changed!

return self[-duplicated]
result = self[-duplicated]
if ignore_index:
return result.reset_index(drop=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here you have to copy before you update the index

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i did not use copy, but .index assignment, and i think it is also correct.

pandas/tests/frame/test_duplicates.py Outdated Show resolved Hide resolved
result = self[-duplicated]

if ignore_index:
result.index = ibase.default_index(sum(-duplicated))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

result.index = ibase.default_index(len(result))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

tm.assert_frame_equal(df, DataFrame(origin_dict))


@pytest.mark.parametrize(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you incorporate these in the above test as well.

can you also assert if inplace== True that the input is unchanged (copy it before and check equality)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

incorporated, and added test

thanks @jreback

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@jreback jreback added this to the 1.0 milestone Dec 27, 2019
@jreback jreback merged commit 7025c59 into pandas-dev:master Dec 27, 2019
@jreback
Copy link
Contributor

jreback commented Dec 27, 2019

thanks @charlesdong1991

AlexKirko pushed a commit to AlexKirko/pandas that referenced this pull request Dec 29, 2019
keechongtan added a commit to keechongtan/pandas that referenced this pull request Dec 29, 2019
…ndexing-1row-df

* upstream/master: (333 commits)
  CI: troubleshoot Web_and_Docs failing (pandas-dev#30534)
  WARN: Ignore NumbaPerformanceWarning in test suite (pandas-dev#30525)
  DEPR: camelCase in offsets, get_offset (pandas-dev#30340)
  PERF: implement scalar ops blockwise (pandas-dev#29853)
  DEPR: Remove Series.compress (pandas-dev#30514)
  ENH: Add numba engine for rolling apply (pandas-dev#30151)
  [ENH] Add to_markdown method (pandas-dev#30350)
  DEPR: Deprecate pandas.np module (pandas-dev#30386)
  ENH: Add ignore_index for df.drop_duplicates (pandas-dev#30405)
  BUG: The setting xrot=0 in DataFrame.hist() doesn't work with by and subplots pandas-dev#30288 (pandas-dev#30491)
  CI: Fix GBQ Tests (pandas-dev#30478)
  Bug groupby quantile listlike q and int columns (pandas-dev#30485)
  ENH: Add ignore_index for df.sort_values and series.sort_values (pandas-dev#30402)
  TYP: Typing hints in pandas/io/formats/{css,csvs}.py (pandas-dev#30398)
  BUG: raise on non-hashable Index name, closes pandas-dev#29069 (pandas-dev#30335)
  Replace "foo!r" to "repr(foo)" syntax pandas-dev#29886 (pandas-dev#30502)
  BUG: preserve EA dtype in transpose (pandas-dev#30091)
  BLD: add check to prevent tempita name error, clsoes pandas-dev#28836 (pandas-dev#30498)
  REF/TST: method-specific files for test_append (pandas-dev#30503)
  marked unused parameters (pandas-dev#30504)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants