-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set difference in Pandas #4617
Comments
related issue re Series object set operations #4480 |
@rsmith31415 maybe your example was a bit unclear, but the 2 solutions look reasonable (and as I said, |
@jreback When you try Joop's solution on a dataframe containing floating point numbers with +5 decimal places, there is a rounding error associated with the set operation. Jeff's answer doesn't produce the desired result. Not sure how and when we can use |
Maybe you should give a detailed floating point input and desired output. It is not clear what you are after. |
You nearly want to do:
But the problem is this just checks whether each element is in each column, not that they appear together. The problem being that these act column-wise. Hmmm... |
Now that I try to reproduce this issue, I suspect this might be related to the precision of dataframes. For example: df1 = pd.DataFrame({'col1':[1.5415541549E12,2.144815145649E12,3.1541515119E12], 'col2':[2.154456546519E12,3.1456165165419E12,4.1456165165419E12]}) produces:
Is there any way to increase the precision? |
that's just the way they're printed....try e.g., |
pandas would have failed a long time ago if it truncated an arbitrary number of decimals simply by constructing a |
You can also look at individual values to confirm that they are in fact the same as when you put them in. E.g.,
|
Exactly right. Given @cpcloud comment, Joon's solution should work well. Nevertheless, it would be nice to have a fast implementation of set-theoretic operations for dataframes and series, particularly useful when dealing with millions of rows. |
@hayd isin covers this yes? |
Once DataFrames isin works with a DataFrame then yes. |
hm is that just not tested? |
@cpcloud it "works" but doesn't do quite what you hope. Let me have a think about this tomorrow afternoon, should be able to fix rather than NotImplement. |
closing in favor of #4480 |
I posted this question set difference for pandas in SO but the answer I received produces some loss of precision.
The basic idea is that if you have two dataframes df1, df2, you should be able to get the rows in df1 that are not included in df2. That is, the set difference of df1 and df2.
I think this would be a good feature request in case it's not possible to solve this easily using Panda's current capabilities
The text was updated successfully, but these errors were encountered: