Add set operations to Series objects #4480

cpcloud · 2013-08-06T14:57:15Z

from this SO question

hayd · 2013-08-06T15:49:46Z

Would this just be on the Series values (i.e. ignore index)?

I had a look in algos and couldn't see anything (np intersect1d seems slow).

OT but a weird thing from that Q is that you can't call Series on a set.

cpcloud · 2013-08-06T15:55:08Z

yet a frozenset works....werid...i would've though that frozenset was a subclass of set, guess not

cpcloud · 2013-08-06T15:55:33Z

well it doesn't work...it returns a frozenset...

hayd · 2013-08-06T15:57:37Z

(I don't see why we enforce this, list is happy to take a set, why shouldn't Series?)

cpcloud · 2013-08-06T15:58:19Z

because there's no way to map indices to set, they are arbitrary since a set object is unordered

hayd · 2013-08-06T15:58:39Z

ha! Series let's a lot of stuff drop through.... eg. Series(1)

cpcloud · 2013-08-06T15:58:53Z

but you're right....list does it so there must be some arbitrary indices assigned

cpcloud · 2013-08-06T16:00:33Z

i just discovered that Series takes generators! i had no idea.

hayd · 2013-08-06T16:06:06Z

and that's the workaround, right? pass it to list first... kinda sucks Also isn't a dict similarly unordered (and yet we allow that) :s

Interestingly I thought we used np.fromiter to do that, but apparently it's just list.

cpcloud · 2013-08-06T16:12:07Z

but dict has keys which are used as the index

cpcloud · 2013-08-06T16:13:18Z

i guess it's a workaround, but how else would you do it?

cpcloud · 2013-08-06T16:14:57Z

i made the change....nothing breaks...i'll submit

cpcloud · 2013-08-06T16:15:08Z

for set construction that is

cpcloud · 2013-08-06T16:16:03Z

#4482

jreback · 2013-08-06T16:18:59Z

fyi...the Series(1) stuff is getting squash in #3862 (by #3482), its just odd

hayd · 2013-08-06T16:19:49Z

Cool bananas

cpcloud · 2013-08-07T15:59:48Z

won't be adding this, so closing

ghost · 2013-08-08T03:58:39Z

@cpcloud, just making sure your closing this is unrelated to the discussion in #4482,
which was only about modifying the series ctor to accept sets, not about adding
set operations to series.

cpcloud · 2013-08-08T04:06:40Z

you're right...that pr was just to disallow frozenset and was only related to the ctor...reopening

jreback · 2013-09-27T02:14:05Z

@cpcloud what's the status on this?

cpcloud · 2013-09-27T02:17:45Z

gone by wayside .... i don't really have time to implement this ... but i think we should leave it open ... marking as someday

jreback · 2013-09-27T02:18:39Z

ok..gr8 thxs

jreback · 2013-09-27T02:19:33Z

most of these I just push to 0.14.....someday is a box very rarely opened :)

makmanalp · 2014-06-09T17:17:19Z

Hmm, is this closed now since 0.14 is out?

jreback · 2014-06-09T17:32:38Z

@makmanalp what are you trying to do?

makmanalp · 2014-06-09T17:35:42Z

Efficiently calculate which of the rows of a column in df1 also exist in another column in df2 (or perhaps indices instead of columns).

jreback · 2014-06-09T17:36:56Z

this issue is a bit different than that (see the linked question), did you try isin?

hayd · 2014-06-09T18:28:29Z

In the original question it looks like the OP wants to ignore the index... in which case they can use the set operations in Index:

pd.Index(s0.values) & pd.Index(s1.values)

hayd · 2014-06-09T18:31:11Z

Wow, that's really slow, I take that back...

jreback · 2014-06-09T18:32:22Z

Index keeps things ordered; should'nt do it that way, better to drop into numpy or python, do the set operation and reconstruct.

milindsmart · 2017-04-21T09:55:26Z

Any update on this? Is this still impossible? I was looking at doing a symmetric difference.

h-vetinari · 2017-12-06T13:35:36Z

This is something I would have needed several times in the last half year. Using apply is just terribly slow for large data sets (and numpy has fast set implementations like np.intersect1d) - for example, I have code where apply+intersect is 98% of running time.

To recap (since there's a lot of tangential discussion in this thread), I think there is a good case to be made for a .set-accessor, providing access to universal functions operating on sets, like there currently is for .str, .dt. As an example:

import pandas as pd # 0.21.0
import numpy as np # 1.13.3

# this function is just for demonstration purposes
def random_sets(n = 100):
    length = 10
    # strings of random numbers, padded to common length
    s = pd.Series(np.random.randint(0, 10**length-1, (n,), dtype = np.int64)).astype(str).str.zfill(length)
    # split into set of individual numbers and cast back to integers
    return s.map(set).apply(lambda s: set(int(x) for x in s)) 

a = random_sets(5)
# WANTED: a.set.intersect(set([1, 2])) should result in:
a.apply(lambda r: r & set([1, 2])) # intersection of 'a' with {1, 2}, e. g.
# 0    {1, 2}
# 1       {1}
# 2        {}
# 3    {1, 2}
# 4        {}

b = random_sets(5)
# WANTED: a.set.intersect(b) should result in:
pd.concat([a, b], keys = ['a', 'b'], axis = 1).apply(lambda row: row['a'] & row['b'], axis = 1) # intersection of 'a' and 'b' (per row!), e. g.
# 0          {8, 1, 2}
# 1    {1, 3, 5, 7, 9}
# 2       {0, 8, 5, 9}
# 3       {8, 2, 3, 6}
# 4       {8, 9, 3, 6}

Like the .str-methods, it should work with either another Series (including index alignment), or broadcast a 'scalar' set correspondingly. Some important methods I think should be implemented (the function signature tries to indicate the action per row; the names are just suggestions):

(set, set) -> set:
a.set.intersect(b) - as above
a.set.union(b) - row-wise union of a and b
a.set.diff(b) - row-wise set difference of a and b
a.set.xor(b) - row-wise symmetric difference of a and b

(set, set) -> bool:
a.set.subset(b) - row-wise check if a is subset of b
a.set.superset(b) - row-wise check if a is superset of b

(set, obj) -> bool:
a.set.contains(c) - row-wise check if a contains c

jreback · 2017-12-06T13:40:56Z

@h-vetinari sets are not efficiently stored, so this offers only an api benefit, which I have yet to see and interesting use case. You can use Index for these operations individually and that is quite efficient. Series.isin is pretty much .contains. IntervalIndex will work for some of these cases as well.

h-vetinari · 2017-12-06T13:54:56Z

@jreback, well, I was hoping not just for an API improvement, but some fast cython code to back it up (like for the .str-methods). ;-)

Do I understand you correctly that you propose to work with Series of (short) pd.Indexes? How would you then do something like a.set.intersect(b) (as described above)?

jreback · 2017-12-06T14:25:35Z

@h-vetinari and you are welcome to contribute things. I the current impl would be quite inefficient and no each way to get around this ATM.

chinchillaLiao · 2018-05-30T07:47:40Z

import pandas as pd
df = pd.DataFrame({'a':{1,2,3}, 'b':{2,3,4}})

Difference operator works between Series.

df['a - b'] = df['a'] - df['b']

Set intersection operactor doesn't work between Series.

df['a & b'] = df['a'] & df['b']

A very slow way to do intersection between Series:

df['a & b'] = df.apply(lambda row: row['a'] & row['b'], axis = 1)

I found it is much more faster to do intersection this way:

df['a & b'] = df['a'] - (df['a'] - df['b'])

I don't know why.

h-vetinari · 2018-06-05T19:54:27Z

@chinchillaLiao : cool, didn't know set difference worked on Series! It's the only one to work on pandas level though.

But an even better work-around is to go down to the numpy-implementation with .values. In particular, this shouldn't suffer from the speed degradation you're reporting.

(@jreback; my comment half a year ago about a .set-accessor now seems very superfluous - why not just enable the numpy-behaviour directly in pandas?)

df = pd.DataFrame([[{1,2}, {2,3}],[{2,4}, {3, 1}]], columns=['A', 'B'])
df
#         A       B
# 0  {1, 2}  {2, 3}
# 1  {2, 4}  {1, 3}

df['A - B'] = df.A - df.B  # only one that work out of the box in pandas
df['A - B']
# 0       {1}
# 1    {2, 4}
# dtype: object

df['A & B'] = df.A & df.B
# TypeError: unsupported operand type(s) for &: 'set' and 'bool'
df['A & B'] = df.A.values & df.B.values
df['A & B'] 
# 0    {2}
# 1     {}
# Name: A & B, dtype: object

df['A | B'] = df.A | df.B
# TypeError: unsupported operand type(s) for |: 'set' and 'bool'
df['A | B'] = df.A.values | df.B.values
df['A | B'] 
# 0       {1, 2, 3}
# 1    {1, 2, 3, 4}
# Name: A | B, dtype: object

df['A ^ B'] = df.A ^ df.B
# TypeError: unsupported operand type(s) for ^: 'set' and 'bool'
df['A ^ B'] = df.A.values ^ df.B.values
df['A ^ B'] 
# 0          {1, 3}
# 1    {1, 2, 3, 4}
# Name: A ^ B, dtype: object

df
#         A       B   A - B A & B         A | B         A ^ B
# 0  {1, 2}  {2, 3}     {1}   {2}     {1, 2, 3}        {1, 3}
# 1  {2, 4}  {1, 3}  {2, 4}    {}  {1, 2, 3, 4}  {1, 2, 3, 4}

In terms of usability, the really cool thing is that this also works for many-to-one comparisons.

dd = df.A.to_frame()
C = {2, 5}
dd['A - C'] = df.A - C
dd['A & C'] = df.A.values & C
dd['A | C'] = df.A.values | C
dd['A ^ C'] = df.A.values ^ C

dd
#         A A - C A & C      A | C   A ^ C
# 0  {1, 2}   {1}   {2}  {1, 2, 5}  {1, 5}
# 1  {2, 4}   {4}   {2}  {2, 4, 5}  {4, 5}

jreback · 2018-06-05T21:07:51Z

sets are not first class and actually completely inefficient in a Series

h-vetinari · 2018-06-05T21:13:29Z

Inefficient as opposed to what? Some situations fundamentally require processing sets.

And even so, why make treating sets harder than it needs to be? I used to think (see my response from December) that this wasn't implemented at all, but since it's in numpy already, why not just expose that functionality on a pandas level? Sure as hell beats writing your own .apply() loops, both in terms of speed and code complexity.

jreback · 2018-06-05T21:38:14Z

complexity in terms of implementation and code

sure if u wanted to contribute would be great

but it’s not trivial to do in a first class supoorted way

hayd · 2018-06-06T02:37:26Z

@h-vetinari what is your use case for this? How does this come up?

IMO a nice way to contribute this would be with an extension type in a library. Depending on your use case. If you have a smallish finite super-set you can describe each set as a bitarray (and hence do set operations cheaply).

Note: This is quite different from the original issue: set operations like set(s1) & set(s2)...

h-vetinari · 2018-06-07T06:09:06Z

@jreback

OK, I'm thinking about contributing that. Since the numpy-methods I showed above are actually not nan-safe,

np.array([{1,2}, np.nan]) | np.array([{2,4}, {3, 1}])
# TypeError: unsupported operand type(s) for |: 'float' and 'set'

I'm back to thinking that a set accessor for Series (not for Index) would be the best. And, since I wouldn't have to write the cython for those methods, I think I can come up with such a wrapper relatively easily.

@hayd
Thanks for the link. I've had several use cases over time, e.g. joining email-addresses/telephone numbers when deduplicating user information. But it keeps cropping up. I'm actually really happy I found out about those numpy-methods some days ago. ;-)

Re:

Note: This is quite different from the original issue: set operations like set(s1) & set(s2)...

I've chosen to comment on this issue (rather than opening a new one) due to the title, which imo has a much larger scope. I could easily open a more general issue, if desired.

jbrockmendel · 2023-02-22T22:07:17Z

Discussed on today's dev call and the consensus was to convert your Series to Index and do setops on those. Closing.

cpcloud closed this as completed Aug 7, 2013

cpcloud reopened this Aug 8, 2013

hayd mentioned this issue Aug 20, 2013

Set difference in Pandas #4617

Closed

jreback modified the milestones: Someday, 0.14.0 Feb 18, 2014

Dr-Irv mentioned this issue Dec 13, 2016

DOC: pandas cheat sheet #13202

Closed

h-vetinari mentioned this issue Jun 19, 2018

ENH: set accessor for Series (WIP) #21547

Closed

h-vetinari mentioned this issue Aug 16, 2018

WIP: EA SetArray #22382

Closed

mroeschke removed the API Design label Apr 11, 2021

jbrockmendel added the setops union, intersection, difference, symmetric_difference label Jun 17, 2021

AlexKirko mentioned this issue Jun 22, 2021

ENH:Create Set Operations #42177

Open

mroeschke removed this from the Someday milestone Oct 13, 2022

jbrockmendel closed this as completed Feb 22, 2023

Add set operations to Series objects #4480

Add set operations to Series objects #4480

Comments

cpcloud commented Aug 6, 2013

hayd commented Aug 6, 2013

cpcloud commented Aug 6, 2013

cpcloud commented Aug 6, 2013

hayd commented Aug 6, 2013

cpcloud commented Aug 6, 2013

hayd commented Aug 6, 2013

cpcloud commented Aug 6, 2013

cpcloud commented Aug 6, 2013

hayd commented Aug 6, 2013

cpcloud commented Aug 6, 2013

cpcloud commented Aug 6, 2013

cpcloud commented Aug 6, 2013

cpcloud commented Aug 6, 2013

cpcloud commented Aug 6, 2013

jreback commented Aug 6, 2013

hayd commented Aug 6, 2013

cpcloud commented Aug 7, 2013

ghost commented Aug 8, 2013

cpcloud commented Aug 8, 2013

jreback commented Sep 27, 2013

cpcloud commented Sep 27, 2013

jreback commented Sep 27, 2013

jreback commented Sep 27, 2013

makmanalp commented Jun 9, 2014

jreback commented Jun 9, 2014

makmanalp commented Jun 9, 2014

jreback commented Jun 9, 2014

hayd commented Jun 9, 2014

hayd commented Jun 9, 2014

jreback commented Jun 9, 2014

milindsmart commented Apr 21, 2017

h-vetinari commented Dec 6, 2017 • edited Loading

jreback commented Dec 6, 2017

h-vetinari commented Dec 6, 2017

jreback commented Dec 6, 2017

chinchillaLiao commented May 30, 2018 • edited Loading

Difference operator works between Series.

Set intersection operactor doesn't work between Series.

A very slow way to do intersection between Series:

I found it is much more faster to do intersection this way:

I don't know why.

h-vetinari commented Jun 5, 2018

jreback commented Jun 5, 2018

h-vetinari commented Jun 5, 2018

jreback commented Jun 5, 2018

hayd commented Jun 6, 2018

h-vetinari commented Jun 7, 2018

jbrockmendel commented Feb 22, 2023

h-vetinari commented Dec 6, 2017 •

edited

Loading

chinchillaLiao commented May 30, 2018 •

edited

Loading