Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new function to remove duplicate rows from a DataFrame #319

Closed
wesm opened this issue Nov 1, 2011 · 3 comments
Closed

Add new function to remove duplicate rows from a DataFrame #319

wesm opened this issue Nov 1, 2011 · 3 comments
Milestone

Comments

@wesm
Copy link
Member

wesm commented Nov 1, 2011

Should be reasonably performant, probably just use sets + np.apply_along_axis

@seanjtaylor
Copy link

Not sure if this should be a 1) dedicated DataFrame method, 2) a method on DataFrameGroupBy, or 3) call to aggregate with a specific function.

# 1
df = DataFrame()
new_df = df.remove_duplicates(('key1', 'key2'))

#2
new_df = df.groupby(('key1', 'key2')).first(check_identical=True) # take only the first row for each group, make sure all other rows with this key have the same values

#3
new_df = df.groupby(('key1', 'key2')).aggregate(first_row) # some function that takes the first row for each group

@wesm
Copy link
Member Author

wesm commented Nov 1, 2011

Here's an easy and fast implementation using existing tools:

grouped = df.groupby(keys)
index = [gp_keys[0] for gp_keys in grouped.groups.values()]
new_df = df.reindex(index)

maybe possible to go even faster, I'll play around with it

@seanjtaylor
Copy link

This is exactly what I was looking for. And I learned something useful about pandas internals today.

wesm added a commit that referenced this issue Nov 7, 2011
@wesm wesm closed this as completed Nov 7, 2011
dan-nadler pushed a commit to dan-nadler/pandas that referenced this issue Sep 23, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants