DataFrame.drop_duplicates works literally only with list of column names, but fails when used on output of DataFrame.columns #1773

spearsem · 2012-08-16T13:00:45Z

DataFrame.drop_duplicates() does not properly handle array objects returned by DataFrame.columns (whether or not you use DataFrame.columns.values to get a NumPy array). If you compute

list(DataFrame.columns.values)

then it works, but this is needless overkill, especially when dealing with a large number of columns. Below is an example from IPython.

In [71]: dfrm = pandas.DataFrame({"A":[1,2,1,2,1,2], "B":[3,4,3,4,3,4], "C":[1,2,1,2,1,3]})

In [72]: dfrm
Out[72]:
   A  B  C
0  1  3  1
1  2  4  2
2  1  3  1
3  2  4  2
4  1  3  1
5  2  4  3

In [73]: dfrm.drop_duplicates(dfrm.columns)
ERROR: An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line statement', (882, 0))
ERROR: An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line statement', (6442, 0))
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
/home/espears/<ipython-input-73-bee9ee352073> in <module>()
----> 1 dfrm.drop_duplicates(dfrm.columns)

/opt/epd/7.2-1/lib/python2.7/site-packages/pandas/core/frame.pyc in drop_duplicates(self, cols, take_last)
   2254         deduplicated : DataFrame
   2255         """
-> 2256         duplicated = self.duplicated(cols, take_last=take_last)
   2257         return self[-duplicated]
   2258

/opt/epd/7.2-1/lib/python2.7/site-packages/pandas/core/frame.pyc in duplicated(self, cols, take_last)
   2283
   2284         duplicated = lib.duplicated(keys, take_last=take_last)
-> 2285         return Series(duplicated, index=self.index)
   2286
   2287     #----------------------------------------------------------------------


/opt/epd/7.2-1/lib/python2.7/site-packages/pandas/core/series.pyc in __new__(cls, data, index, dtype, name, copy)
    286         else:
    287             subarr = subarr.view(Series)
--> 288         subarr.index = index
    289         subarr.name = name
    290

/opt/epd/7.2-1/lib/python2.7/site-packages/pandas/_tseries.so in pandas._tseries.SeriesIndex.__set__ (pandas/src/tseries.c:73097)()

AssertionError: Index length did not match values

In [74]: dfrm.drop_duplicates(dfrm.columns.values)
ERROR: An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line statement', (882, 0))
ERROR: An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line statement', (6442, 0))
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
/home/espears/<ipython-input-74-cb96df701a9b> in <module>()
----> 1 dfrm.drop_duplicates(dfrm.columns.values)

/opt/epd/7.2-1/lib/python2.7/site-packages/pandas/core/frame.pyc in drop_duplicates(self, cols, take_last)
   2254         deduplicated : DataFrame
   2255         """
-> 2256         duplicated = self.duplicated(cols, take_last=take_last)
   2257         return self[-duplicated]
   2258

/opt/epd/7.2-1/lib/python2.7/site-packages/pandas/core/frame.pyc in duplicated(self, cols, take_last)
   2283
   2284         duplicated = lib.duplicated(keys, take_last=take_last)
-> 2285         return Series(duplicated, index=self.index)
   2286
   2287     #----------------------------------------------------------------------


/opt/epd/7.2-1/lib/python2.7/site-packages/pandas/core/series.pyc in __new__(cls, data, index, dtype, name, copy)
    286         else:
    287             subarr = subarr.view(Series)
--> 288         subarr.index = index
    289         subarr.name = name
    290

/opt/epd/7.2-1/lib/python2.7/site-packages/pandas/_tseries.so in pandas._tseries.SeriesIndex.__set__ (pandas/src/tseries.c:73097)()

AssertionError: Index length did not match values

In [75]: dfrm.columns.values
Out[75]: array([A, B, C], dtype=object)

In [76]: list(dfrm.columns.values)
Out[76]: ['A', 'B', 'C']

In [77]: dfrm.drop_duplicates(list(dfrm.columns.values))
Out[77]:
   A  B  C
0  1  3  1
1  2  4  2
5  2  4  3

FWIW:

In [91]: pandas.__version__
Out[91]: '0.7.3'

The text was updated successfully, but these errors were encountered:

lodagro · 2012-09-04T20:28:02Z

fixed

In [2]: dfrm = pandas.DataFrame({"A":[1,2,1,2,1,2], "B":[3,4,3,4,3,4], "C":[1,2,1,2,1,3]})

In [3]: dfrm.drop_duplicates(dfrm.columns)
Out[3]: 
   A  B  C
0  1  3  1
1  2  4  2
5  2  4  3

lodagro closed this as completed in cfe674e Sep 4, 2012

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFrame.drop_duplicates works literally only with list of column names, but fails when used on output of DataFrame.columns #1773

DataFrame.drop_duplicates works literally only with list of column names, but fails when used on output of DataFrame.columns #1773

spearsem commented Aug 16, 2012

lodagro commented Sep 4, 2012

DataFrame.drop_duplicates works literally only with list of column names, but fails when used on output of DataFrame.columns #1773

DataFrame.drop_duplicates works literally only with list of column names, but fails when used on output of DataFrame.columns #1773

Comments

spearsem commented Aug 16, 2012

lodagro commented Sep 4, 2012