Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

set_index drops data in the presence of duplicates when inplace=True and verify_integrity=True #1831

Closed
snth opened this issue Aug 31, 2012 · 0 comments
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves
Milestone

Comments

@snth
Copy link
Contributor

snth commented Aug 31, 2012

When calling set_index on an index with duplicates, the verify_integrity=True option correctly identifies the duplicates but this check appears to take place after the original columns have already been dropped when inplace=True is also passed. This results in data being lost.

I believe it would be better if the original DataFrame object was only modified in the case that the set_index operation is successful.

Code to reproduce the problem:

In [189]: df = DataFrame({'one':[1, 1, 2], 'two':[1,2,3]})

In [190]: df
Out[190]: 
   one  two
0    1    1
1    1    2
2    2    3

In [191]: df.set_index(['one'], inplace=True, verify_integrity=True)
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
/mnt/hgfs/fastdata/<ipython-input-191-e1c0e8c92f6c> in <module>()
----> 1 df.set_index(['one'], inplace=True, verify_integrity=True)

/home/tobias/code/envs/mac/local/lib/python2.7/site-packages/pandas/core/frame.pyc in set_index(self, keys, drop, append, inplace, verify_integrity)
   2328         if verify_integrity and not index.is_unique:
   2329             duplicates = index.get_duplicates()
-> 2330             raise Exception('Index has duplicate keys: %s' % duplicates)
   2331 
   2332         # clear up memory usage


Exception: Index has duplicate keys: [1]

In [192]: df
Out[192]: 
   two
0    1
1    2
2    3

In [202]: print sys.version
2.7.3 (default, Aug  1 2012, 05:14:39) 
[GCC 4.6.3]

In [203]: print pd.version.version
0.8.1

In [204]: 
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

No branches or pull requests

2 participants