duplicated removing items that are not duplicates in 0.18 #14443

johnml1135 · 2016-10-17T19:36:17Z

I wish I could say more about this, but it is with proprietary data. Basically, I have a DataFrame of 10 columns and around 7000 rows. When I call "duplicated" on 4 of the rows (3 have strings, 1 has a float), I get around 10 items being flagged as duplicate (5 pairs) when in fact, they are different.

This is sample text of what is being flagged:

          Type   Line                                         LineString  \
1885      else  832.0                (&temp32)->byte1 = (&temp32)->byte4   
1895  do while  832.0                                        do while(0)   
4515      enum  122.0    enum {QQ_ON_UNASSERTED = 0, QQ_ON_ASSERTED = 1}   
4521      enum  167.0          enum {FIELD_RESET = 1, FIELD_TRIPPED = 0}   

              Parameter  Filename  
1885             temp32   arinc.c  
1895                      arinc.c  
4515      QQ_ON_ASSERTED  eeprom.c  
4521       FIELD_TRIPPED  eeprom.c

I know that duplicates are checked through hashing, but is there some way to, I don't know, compare a checksum or a more robust measurement to ensure that only duplicates are flagged? What is the chance that two items could be hashed to the same value?

By adding more of the fields to check the data I am able to prevent false duplicate flags.

I am using Pandas 0.18 from WinPython-64bit-3.4.4.2Qt5.

The text was updated successfully, but these errors were encountered:

jreback · 2016-10-17T20:16:26Z

show what you are actually calling as well as df.info().

johnml1135 · 2016-10-18T11:14:11Z

Here is the lines of code that I run:

    dup_cols = ["Type","Line","LineString","Parameter"]
    pdata3 = pdata2.drop_duplicates(subset=dup_cols)

and here is the result of pdata3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7120 entries, 0 to 20578
Data columns (total 17 columns):
Block              7120 non-null object
Blocks             7120 non-null object
Filename           7120 non-null object
Impacts            7120 non-null object
ImpactsUnique      7120 non-null object
Level              7120 non-null int64
Line               7120 non-null float64
LineString         7120 non-null object
OwningFunction     7120 non-null object
Parameter          7120 non-null object
ParameterType      7120 non-null object
ParameterUnique    7120 non-null object
Type               7120 non-null object
ParameterFull      7120 non-null object
ParameterField     7120 non-null object
ImpactsFull        7120 non-null object
ImpactsField       7120 non-null object
dtypes: float64(1), int64(1), object(15)
memory usage: 1001.2+ KB

johnml1135 · 2016-10-18T11:47:00Z

I have been looking further into the code and stepping along the pandas code and I found that (predictably) it was my own mistake. There were duplicates, I was just not seeing it. You can close this ticket.

johnml1135 closed this as completed Oct 18, 2016

jorisvandenbossche added this to the No action milestone Oct 18, 2016

jbrockmendel mentioned this issue Nov 22, 2017

Unify index and multindex (and possibly others) API #3268

Closed

17 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

duplicated removing items that are not duplicates in 0.18 #14443

duplicated removing items that are not duplicates in 0.18 #14443

johnml1135 commented Oct 17, 2016 •

edited

Loading

jreback commented Oct 17, 2016

johnml1135 commented Oct 18, 2016

johnml1135 commented Oct 18, 2016

duplicated removing items that are not duplicates in 0.18 #14443

duplicated removing items that are not duplicates in 0.18 #14443

Comments

johnml1135 commented Oct 17, 2016 • edited Loading

jreback commented Oct 17, 2016

johnml1135 commented Oct 18, 2016

johnml1135 commented Oct 18, 2016

johnml1135 commented Oct 17, 2016 •

edited

Loading