Set difference in Pandas #4617

ghost · 2013-08-20T16:49:05Z

I posted this question set difference for pandas in SO but the answer I received produces some loss of precision.

The basic idea is that if you have two dataframes df1, df2, you should be able to get the rows in df1 that are not included in df2. That is, the set difference of df1 and df2.

I think this would be a good feature request in case it's not possible to solve this easily using Panda's current capabilities

hayd · 2013-08-20T17:27:02Z

related issue re Series object set operations #4480

jreback · 2013-08-20T18:11:35Z

@rsmith31415 maybe your example was a bit unclear, but the 2 solutions look reasonable (and as I said, isin might be a nicer solution); what do you mean by loss of precision?

ghost · 2013-08-20T19:07:40Z

@jreback When you try Joop's solution on a dataframe containing floating point numbers with +5 decimal places, there is a rounding error associated with the set operation.

Jeff's answer doesn't produce the desired result.

Not sure how and when we can use isin. Could you elaborate on that?

jreback · 2013-08-20T19:12:17Z

Maybe you should give a detailed floating point input and desired output. It is not clear what you are after.

hayd · 2013-08-20T20:16:31Z

You nearly want to do:

In [8]: df2[~df2.isin(df1.to_dict(outtype='list')).all(1)]
Out[8]: 
   col1  col2
0     4     6
2     5     5

In [9]: df2[~df2.isin(df1).all(1)]  # soon

But the problem is this just checks whether each element is in each column, not that they appear together. The problem being that these act column-wise. Hmmm...

ghost · 2013-08-20T20:26:33Z

Now that I try to reproduce this issue, I suspect this might be related to the precision of dataframes. For example:

df1 = pd.DataFrame({'col1':[1.5415541549E12,2.144815145649E12,3.1541515119E12], 'col2':[2.154456546519E12,3.1456165165419E12,4.1456165165419E12]})

produces:

0 1.541554e+12 2.154457e+12
1 2.144815e+12 3.145617e+12
2 3.154152e+12 4.145617e+12

Is there any way to increase the precision?

cpcloud · 2013-08-20T20:27:52Z

that's just the way they're printed....try e.g., pd.set_option('display.precision', 12)

cpcloud · 2013-08-20T20:29:07Z

pandas would have failed a long time ago if it truncated an arbitrary number of decimals simply by constructing a DataFrame!

cpcloud · 2013-08-20T20:31:52Z

You can also look at individual values to confirm that they are in fact the same as when you put them in. E.g.,

In [8]: df1
Out[8]:
            col1               col2
0  1541554154900  2.15445654652e+12
1  2144815145649  3.14561651654e+12
2  3154151511900  4.14561651654e+12

In [9]: df1.col2[2]
Out[9]: 4145616516541.8999

In [10]: test_value = 4.1456165165419E12

In [12]: df1.col2[2] == test_value
Out[12]: True

ghost · 2013-08-20T20:32:07Z

@hayd

But the problem is this just checks whether each element is in each column, not that they appear together. The problem being that these act column-wise. Hmmm...

Exactly right.

Given @cpcloud comment, Joon's solution should work well. Nevertheless, it would be nice to have a fast implementation of set-theoretic operations for dataframes and series, particularly useful when dealing with millions of rows.

jreback · 2013-09-27T02:14:46Z

@hayd isin covers this yes?

hayd · 2013-09-27T02:36:59Z

Once DataFrames isin works with a DataFrame then yes.

cpcloud · 2013-09-27T02:41:34Z

hm is that just not tested? df.isin(df2) doesn't raise

hayd · 2013-09-27T04:06:40Z

@cpcloud it "works" but doesn't do quite what you hope. Let me have a think about this tomorrow afternoon, should be able to fix rather than NotImplement.

jreback · 2013-09-30T13:40:27Z

closing in favor of #4480

hayd mentioned this issue Aug 21, 2013

ENH/API: DataFrame's isin should accept DataFrames #4421

Closed

hayd closed this as completed May 29, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set difference in Pandas #4617

Set difference in Pandas #4617

ghost commented Aug 20, 2013

hayd commented Aug 20, 2013

jreback commented Aug 20, 2013

ghost commented Aug 20, 2013

jreback commented Aug 20, 2013

hayd commented Aug 20, 2013

ghost commented Aug 20, 2013

cpcloud commented Aug 20, 2013

cpcloud commented Aug 20, 2013

cpcloud commented Aug 20, 2013

ghost commented Aug 20, 2013

jreback commented Sep 27, 2013

hayd commented Sep 27, 2013

cpcloud commented Sep 27, 2013

hayd commented Sep 27, 2013

jreback commented Sep 30, 2013

Set difference in Pandas #4617

Set difference in Pandas #4617

Comments

ghost commented Aug 20, 2013

hayd commented Aug 20, 2013

jreback commented Aug 20, 2013

ghost commented Aug 20, 2013

jreback commented Aug 20, 2013

hayd commented Aug 20, 2013

ghost commented Aug 20, 2013

cpcloud commented Aug 20, 2013

cpcloud commented Aug 20, 2013

cpcloud commented Aug 20, 2013

ghost commented Aug 20, 2013

jreback commented Sep 27, 2013

hayd commented Sep 27, 2013

cpcloud commented Sep 27, 2013

hayd commented Sep 27, 2013

jreback commented Sep 30, 2013