Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

duplicated removing items that are not duplicates in 0.18 #14443

Closed
johnml1135 opened this issue Oct 17, 2016 · 3 comments
Closed

duplicated removing items that are not duplicates in 0.18 #14443

johnml1135 opened this issue Oct 17, 2016 · 3 comments

Comments

@johnml1135
Copy link

johnml1135 commented Oct 17, 2016

I wish I could say more about this, but it is with proprietary data. Basically, I have a DataFrame of 10 columns and around 7000 rows. When I call "duplicated" on 4 of the rows (3 have strings, 1 has a float), I get around 10 items being flagged as duplicate (5 pairs) when in fact, they are different.

This is sample text of what is being flagged:

          Type   Line                                         LineString  \
1885      else  832.0                (&temp32)->byte1 = (&temp32)->byte4   
1895  do while  832.0                                        do while(0)   
4515      enum  122.0    enum {QQ_ON_UNASSERTED = 0, QQ_ON_ASSERTED = 1}   
4521      enum  167.0          enum {FIELD_RESET = 1, FIELD_TRIPPED = 0}   

              Parameter  Filename  
1885             temp32   arinc.c  
1895                      arinc.c  
4515      QQ_ON_ASSERTED  eeprom.c  
4521       FIELD_TRIPPED  eeprom.c  

I know that duplicates are checked through hashing, but is there some way to, I don't know, compare a checksum or a more robust measurement to ensure that only duplicates are flagged? What is the chance that two items could be hashed to the same value?

By adding more of the fields to check the data I am able to prevent false duplicate flags.

I am using Pandas 0.18 from WinPython-64bit-3.4.4.2Qt5.

@jreback
Copy link
Contributor

jreback commented Oct 17, 2016

show what you are actually calling as well as df.info().

@johnml1135
Copy link
Author

Here is the lines of code that I run:

    dup_cols = ["Type","Line","LineString","Parameter"]
    pdata3 = pdata2.drop_duplicates(subset=dup_cols)

and here is the result of pdata3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7120 entries, 0 to 20578
Data columns (total 17 columns):
Block              7120 non-null object
Blocks             7120 non-null object
Filename           7120 non-null object
Impacts            7120 non-null object
ImpactsUnique      7120 non-null object
Level              7120 non-null int64
Line               7120 non-null float64
LineString         7120 non-null object
OwningFunction     7120 non-null object
Parameter          7120 non-null object
ParameterType      7120 non-null object
ParameterUnique    7120 non-null object
Type               7120 non-null object
ParameterFull      7120 non-null object
ParameterField     7120 non-null object
ImpactsFull        7120 non-null object
ImpactsField       7120 non-null object
dtypes: float64(1), int64(1), object(15)
memory usage: 1001.2+ KB

@johnml1135
Copy link
Author

I have been looking further into the code and stepping along the pandas code and I found that (predictably) it was my own mistake. There were duplicates, I was just not seeing it. You can close this ticket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants