Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrame.drop_duplicates works literally only with list of column names, but fails when used on output of DataFrame.columns #1773

Closed
spearsem opened this issue Aug 16, 2012 · 1 comment
Labels
Milestone

Comments

@spearsem
Copy link

DataFrame.drop_duplicates() does not properly handle array objects returned by DataFrame.columns (whether or not you use DataFrame.columns.values to get a NumPy array). If you compute

list(DataFrame.columns.values)

then it works, but this is needless overkill, especially when dealing with a large number of columns. Below is an example from IPython.

In [71]: dfrm = pandas.DataFrame({"A":[1,2,1,2,1,2], "B":[3,4,3,4,3,4], "C":[1,2,1,2,1,3]})

In [72]: dfrm
Out[72]:
   A  B  C
0  1  3  1
1  2  4  2
2  1  3  1
3  2  4  2
4  1  3  1
5  2  4  3

In [73]: dfrm.drop_duplicates(dfrm.columns)
ERROR: An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line statement', (882, 0))
ERROR: An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line statement', (6442, 0))
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
/home/espears/<ipython-input-73-bee9ee352073> in <module>()
----> 1 dfrm.drop_duplicates(dfrm.columns)

/opt/epd/7.2-1/lib/python2.7/site-packages/pandas/core/frame.pyc in drop_duplicates(self, cols, take_last)
   2254         deduplicated : DataFrame
   2255         """
-> 2256         duplicated = self.duplicated(cols, take_last=take_last)
   2257         return self[-duplicated]
   2258

/opt/epd/7.2-1/lib/python2.7/site-packages/pandas/core/frame.pyc in duplicated(self, cols, take_last)
   2283
   2284         duplicated = lib.duplicated(keys, take_last=take_last)
-> 2285         return Series(duplicated, index=self.index)
   2286
   2287     #----------------------------------------------------------------------


/opt/epd/7.2-1/lib/python2.7/site-packages/pandas/core/series.pyc in __new__(cls, data, index, dtype, name, copy)
    286         else:
    287             subarr = subarr.view(Series)
--> 288         subarr.index = index
    289         subarr.name = name
    290

/opt/epd/7.2-1/lib/python2.7/site-packages/pandas/_tseries.so in pandas._tseries.SeriesIndex.__set__ (pandas/src/tseries.c:73097)()

AssertionError: Index length did not match values

In [74]: dfrm.drop_duplicates(dfrm.columns.values)
ERROR: An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line statement', (882, 0))
ERROR: An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line statement', (6442, 0))
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
/home/espears/<ipython-input-74-cb96df701a9b> in <module>()
----> 1 dfrm.drop_duplicates(dfrm.columns.values)

/opt/epd/7.2-1/lib/python2.7/site-packages/pandas/core/frame.pyc in drop_duplicates(self, cols, take_last)
   2254         deduplicated : DataFrame
   2255         """
-> 2256         duplicated = self.duplicated(cols, take_last=take_last)
   2257         return self[-duplicated]
   2258

/opt/epd/7.2-1/lib/python2.7/site-packages/pandas/core/frame.pyc in duplicated(self, cols, take_last)
   2283
   2284         duplicated = lib.duplicated(keys, take_last=take_last)
-> 2285         return Series(duplicated, index=self.index)
   2286
   2287     #----------------------------------------------------------------------


/opt/epd/7.2-1/lib/python2.7/site-packages/pandas/core/series.pyc in __new__(cls, data, index, dtype, name, copy)
    286         else:
    287             subarr = subarr.view(Series)
--> 288         subarr.index = index
    289         subarr.name = name
    290

/opt/epd/7.2-1/lib/python2.7/site-packages/pandas/_tseries.so in pandas._tseries.SeriesIndex.__set__ (pandas/src/tseries.c:73097)()

AssertionError: Index length did not match values

In [75]: dfrm.columns.values
Out[75]: array([A, B, C], dtype=object)

In [76]: list(dfrm.columns.values)
Out[76]: ['A', 'B', 'C']

In [77]: dfrm.drop_duplicates(list(dfrm.columns.values))
Out[77]:
   A  B  C
0  1  3  1
1  2  4  2
5  2  4  3

FWIW:

In [91]: pandas.__version__
Out[91]: '0.7.3'
@lodagro
Copy link
Contributor

lodagro commented Sep 4, 2012

fixed

In [2]: dfrm = pandas.DataFrame({"A":[1,2,1,2,1,2], "B":[3,4,3,4,3,4], "C":[1,2,1,2,1,3]})

In [3]: dfrm.drop_duplicates(dfrm.columns)
Out[3]: 
   A  B  C
0  1  3  1
1  2  4  2
5  2  4  3

@lodagro lodagro closed this as completed in cfe674e Sep 4, 2012
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants