Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set difference in Pandas #4617

Closed
ghost opened this issue Aug 20, 2013 · 15 comments
Closed

Set difference in Pandas #4617

ghost opened this issue Aug 20, 2013 · 15 comments
Labels
Dtype Conversions Unexpected or buggy dtype conversions Enhancement

Comments

@ghost
Copy link

ghost commented Aug 20, 2013

I posted this question set difference for pandas in SO but the answer I received produces some loss of precision.

The basic idea is that if you have two dataframes df1, df2, you should be able to get the rows in df1 that are not included in df2. That is, the set difference of df1 and df2.

I think this would be a good feature request in case it's not possible to solve this easily using Panda's current capabilities

@hayd
Copy link
Contributor

hayd commented Aug 20, 2013

related issue re Series object set operations #4480

@jreback
Copy link
Contributor

jreback commented Aug 20, 2013

@rsmith31415 maybe your example was a bit unclear, but the 2 solutions look reasonable (and as I said, isin might be a nicer solution); what do you mean by loss of precision?

@ghost
Copy link
Author

ghost commented Aug 20, 2013

@jreback When you try Joop's solution on a dataframe containing floating point numbers with +5 decimal places, there is a rounding error associated with the set operation.

Jeff's answer doesn't produce the desired result.

Not sure how and when we can use isin. Could you elaborate on that?

@jreback
Copy link
Contributor

jreback commented Aug 20, 2013

Maybe you should give a detailed floating point input and desired output. It is not clear what you are after.

@hayd
Copy link
Contributor

hayd commented Aug 20, 2013

You nearly want to do:

In [8]: df2[~df2.isin(df1.to_dict(outtype='list')).all(1)]
Out[8]: 
   col1  col2
0     4     6
2     5     5

In [9]: df2[~df2.isin(df1).all(1)]  # soon

But the problem is this just checks whether each element is in each column, not that they appear together. The problem being that these act column-wise. Hmmm...

@ghost
Copy link
Author

ghost commented Aug 20, 2013

Now that I try to reproduce this issue, I suspect this might be related to the precision of dataframes. For example:

df1 = pd.DataFrame({'col1':[1.5415541549E12,2.144815145649E12,3.1541515119E12], 'col2':[2.154456546519E12,3.1456165165419E12,4.1456165165419E12]})

produces:

0 1.541554e+12 2.154457e+12
1 2.144815e+12 3.145617e+12
2 3.154152e+12 4.145617e+12

Is there any way to increase the precision?

@cpcloud
Copy link
Member

cpcloud commented Aug 20, 2013

that's just the way they're printed....try e.g., pd.set_option('display.precision', 12)

@cpcloud
Copy link
Member

cpcloud commented Aug 20, 2013

pandas would have failed a long time ago if it truncated an arbitrary number of decimals simply by constructing a DataFrame!

@cpcloud
Copy link
Member

cpcloud commented Aug 20, 2013

You can also look at individual values to confirm that they are in fact the same as when you put them in. E.g.,

In [8]: df1
Out[8]:
            col1               col2
0  1541554154900  2.15445654652e+12
1  2144815145649  3.14561651654e+12
2  3154151511900  4.14561651654e+12

In [9]: df1.col2[2]
Out[9]: 4145616516541.8999

In [10]: test_value = 4.1456165165419E12

In [12]: df1.col2[2] == test_value
Out[12]: True

@ghost
Copy link
Author

ghost commented Aug 20, 2013

@hayd

But the problem is this just checks whether each element is in each column, not that they appear together. The problem being that these act column-wise. Hmmm...

Exactly right.

Given @cpcloud comment, Joon's solution should work well. Nevertheless, it would be nice to have a fast implementation of set-theoretic operations for dataframes and series, particularly useful when dealing with millions of rows.

@jreback
Copy link
Contributor

jreback commented Sep 27, 2013

@hayd isin covers this yes?

@hayd
Copy link
Contributor

hayd commented Sep 27, 2013

Once DataFrames isin works with a DataFrame then yes.

@cpcloud
Copy link
Member

cpcloud commented Sep 27, 2013

hm is that just not tested? df.isin(df2) doesn't raise

@hayd
Copy link
Contributor

hayd commented Sep 27, 2013

@cpcloud it "works" but doesn't do quite what you hope. Let me have a think about this tomorrow afternoon, should be able to fix rather than NotImplement.

@jreback
Copy link
Contributor

jreback commented Sep 30, 2013

closing in favor of #4480

@hayd hayd closed this as completed May 29, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions Enhancement
Projects
None yet
Development

No branches or pull requests

3 participants