Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

setdiff!(a,b) slower than setdiff(a,b) when b is large (and viceversa when a is large) #25317

Closed
goretkin opened this issue Dec 29, 2017 · 9 comments · Fixed by #29048
Closed
Labels
collections Data structures holding multiple items, e.g. sets good first issue Indicates a good issue for first-time contributors to Julia performance Must go faster

Comments

@goretkin
Copy link
Contributor

julia> s = Set(rand(1_000_000));

julia> s1 = Set([1.0])
Set([1.0])

julia> @time setdiff!(s1, s)
  0.049974 seconds (19.83 k allocations: 1.146 MiB)
Set([1.0])

julia> @time setdiff!(s1, s)
  0.023034 seconds (4 allocations: 160 bytes)
Set([1.0])

julia> s1
Set([1.0])

julia> @time setdiff(s1, s)
  0.012756 seconds (5.46 k allocations: 325.882 KiB)
Set([1.0])

julia> @time setdiff(s1, s)
  0.000005 seconds (9 allocations: 656 bytes)
Set([1.0])

setdiff! iterates through all of b. setdiff iterates through all of a. Both functions could be specialized for b::Set.

On
Version 0.7.0-DEV.3096
Commit 3216445* (10 days old master)

@rfourquet
Copy link
Member

If I'm not mistaken, the recently merged #23528 have made setdiff as slow as setdiff! 😅
The reason is that it's convenient to implement setdiff in terms of setdiff!. But as you note, the performance are inversed when you swap a and b, so indeed it can be specialized when they are both sets, but there has to be some tuning, to find a good threshold deciding whether you iterate a or b. Maybe choosing the smaller one is a good enough start.

@rfourquet rfourquet added collections Data structures holding multiple items, e.g. sets good first issue Indicates a good issue for first-time contributors to Julia performance Must go faster labels Dec 29, 2017
@techytoes
Copy link

techytoes commented Jan 2, 2018

@rfourquet is this issue still open? I really want to take this issue.I'm new to this organisation.

@nalimilan
Copy link
Member

Yes, it's still open and help would certainly be welcome.

@techytoes
Copy link

Okay! I want to work on this issue.Can you please guide me with the required files and changes?

@rfourquet
Copy link
Member

Can you please guide me with the required files and changes?

For the files, you can use the @which macro, e.g. @which setdiff(Set([1]), Set([2])). For the required changes, you need to familiarize yourself with how the functions work now, and I guess introduce two branches depending on the size of the arguments.

@guilhermeleobas
Copy link
Contributor

Hi @firedranzer, are you still working on this issue? If not, can I start work on it?

@KristofferC
Copy link
Member

Go for it.

As a note, it is for some reason extremely rare for people who ask in issues if they can work on something to actually come through and make a PR. Be the exception :)

@laborg
Copy link
Contributor

laborg commented Sep 11, 2018

I see similar asymmetric problems with symdiff, intersect and union. It might be worthwhile to invest in a more generic solution for this problem.

@StefanKarpinski
Copy link
Member

@laborg: your work on this so far on this is great—it would be amazing if you wanted to continue to work away at this and make all of these functions cleverer and more efficient!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
collections Data structures holding multiple items, e.g. sets good first issue Indicates a good issue for first-time contributors to Julia performance Must go faster
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants