-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add set operations to Series objects #4480
Comments
Would this just be on the Series values (i.e. ignore index)? I had a look in algos and couldn't see anything (np intersect1d seems slow). OT but a weird thing from that Q is that you can't call Series on a set. |
yet a |
well it doesn't work...it returns a |
(I don't see why we enforce this, list is happy to take a set, why shouldn't Series?) |
because there's no way to map indices to set, they are arbitrary since a set object is unordered |
ha! Series let's a lot of stuff drop through.... eg. |
but you're right....list does it so there must be some arbitrary indices assigned |
i just discovered that |
and that's the workaround, right? pass it to list first... kinda sucks Also isn't a dict similarly unordered (and yet we allow that) :s Interestingly I thought we used np.fromiter to do that, but apparently it's just list. |
but |
i guess it's a workaround, but how else would you do it? |
i made the change....nothing breaks...i'll submit |
for |
Cool bananas |
won't be adding this, so closing |
you're right...that pr was just to disallow |
@cpcloud what's the status on this? |
gone by wayside .... i don't really have time to implement this ... but i think we should leave it open ... marking as someday |
ok..gr8 thxs |
most of these I just push to 0.14.....someday is a box very rarely opened :) |
Hmm, is this closed now since 0.14 is out? |
@makmanalp what are you trying to do? |
Efficiently calculate which of the rows of a column in df1 also exist in another column in df2 (or perhaps indices instead of columns). |
this issue is a bit different than that (see the linked question), did you try |
In the original question it looks like the OP wants to ignore the index... in which case they can use the set operations in Index:
|
Wow, that's really slow, I take that back... |
Index keeps things ordered; should'nt do it that way, better to drop into numpy or python, do the set operation and reconstruct. |
Any update on this? Is this still impossible? I was looking at doing a symmetric difference. |
This is something I would have needed several times in the last half year. Using To recap (since there's a lot of tangential discussion in this thread), I think there is a good case to be made for a
Like the (set, set) -> set: (set, set) -> bool: (set, obj) -> bool: |
@h-vetinari sets are not efficiently stored, so this offers only an api benefit, which I have yet to see and interesting use case. You can use |
@jreback, well, I was hoping not just for an API improvement, but some fast cython code to back it up (like for the Do I understand you correctly that you propose to work with |
@h-vetinari and you are welcome to contribute things. I the current impl would be quite inefficient and no each way to get around this ATM. |
import pandas as pd Difference operator works between Series.df['a - b'] = df['a'] - df['b'] Set intersection operactor doesn't work between Series.df['a & b'] = df['a'] & df['b'] A very slow way to do intersection between Series:df['a & b'] = df.apply(lambda row: row['a'] & row['b'], axis = 1) I found it is much more faster to do intersection this way:df['a & b'] = df['a'] - (df['a'] - df['b']) I don't know why. |
@chinchillaLiao : cool, didn't know set difference worked on Series! It's the only one to work on pandas level though. But an even better work-around is to go down to the (@jreback; my comment half a year ago about a
In terms of usability, the really cool thing is that this also works for
|
sets are not first class and actually completely inefficient in a Series |
Inefficient as opposed to what? Some situations fundamentally require processing sets. And even so, why make treating sets harder than it needs to be? I used to think (see my response from December) that this wasn't implemented at all, but since it's in |
complexity in terms of implementation and code sure if u wanted to contribute would be great but it’s not trivial to do in a first class supoorted way |
@h-vetinari what is your use case for this? How does this come up? IMO a nice way to contribute this would be with an extension type in a library. Depending on your use case. If you have a smallish finite super-set you can describe each set as a bitarray (and hence do set operations cheaply). Note: This is quite different from the original issue: set operations like |
OK, I'm thinking about contributing that. Since the numpy-methods I showed above are actually not nan-safe,
I'm back to thinking that a set accessor for Series (not for Index) would be the best. And, since I wouldn't have to write the cython for those methods, I think I can come up with such a wrapper relatively easily. @hayd Re:
I've chosen to comment on this issue (rather than opening a new one) due to the title, which imo has a much larger scope. I could easily open a more general issue, if desired. |
Discussed on today's dev call and the consensus was to convert your Series to Index and do setops on those. Closing. |
from this SO question
The text was updated successfully, but these errors were encountered: