Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add tmapreduce #1

Closed
ChrisRackauckas opened this issue Mar 15, 2018 · 11 comments
Closed

add tmapreduce #1

ChrisRackauckas opened this issue Mar 15, 2018 · 11 comments

Comments

@ChrisRackauckas
Copy link

No description provided.

@bkamins
Copy link
Collaborator

bkamins commented Mar 18, 2018

I have made a preliminary write-up of map-reduce. I will try to run comprehensive tests next week. The problem with threaded map-reduce is that we have to assume that what f returns and op form Abelian group and in general the results will be not reproducible (e.g. when Float64 addition is considered).
I have also created a special function for addition which is faster than a general implementation as I guess this is the most common use case.

There are no such issues with tmap!.

@fisiognomico
Copy link

Hi!
The work done so far seems very promising, I have been working on writing some additional tests to count overhead contribution of trandjump(). I would also like to help you with tmapreduce!(), I will send you something in the next days.
Regarding imposing the commutativity of f and op is something which is a very hard problem to solve, and can be determined only in a few special cases (link)[https://link.springer.com/chapter/10.1007%2F978-3-662-46681-0_9].
Julia in reduce.jl seems not to cover the case where the operators are non-Abelian (which can be disregarded as user misbehaviour).
Anyway I will try to contribute, I hope that this can be made a PR soon, in any case tell me if you need any help!

@bkamins
Copy link
Collaborator

bkamins commented May 26, 2018

Sure! As you can see in the code I have tmapreduce and tmapadd (a special case that avoids locks). How would you see tmapreduce! to differ (i.e. I do not see ! needed - input data do not get modified here - as opposed to tmap! which overwrites dst vector).

The only thing I am waiting for with moving forward is Julia 0.7 to be out to be sure I optimize against a stabilized thing (especially as you probably know - threading model in Julia is experimental https://discourse.julialang.org/t/future-of-base-threads/10440 and might change in the future).

Anyway - help would be appreciated. As I have written in the other issue you have commented on I plan to talk about those things on JuliaCon 2018 so it would be great to work out the best practice to share with everyone.

@mohamed82008
Copy link
Owner

mohamed82008 commented Jul 11, 2018

I tried the tmapreduce, it had some bug so I fixed it, but I can't seem to get any speedup over the single-threaded mapreduce. The reason seems to be the closure bug introduced by the @threads macro, since the exact same code is inferred without the outer threads loop.

@mohamed82008
Copy link
Owner

mohamed82008 commented Jul 11, 2018

tmapadd is properly inferred but still allocates ALOT and is much slower than the unthreaded version. I will share my code in a branch.

@mohamed82008
Copy link
Owner

mohamed82008 commented Jul 11, 2018

The code can be found in https://github.com/mohamed82008/KissThreading.jl/tree/fixmapreduce which is based on the PR #2.

@mohamed82008
Copy link
Owner

On the other hand a complete rework using static load division was made in https://github.com/mohamed82008/KissThreading.jl/tree/rework and all of tmap!, tmapreduce and tmapadd functions scale almost linearly with the number of threads.

I noticed the tmap! is still a bit slower than the simple threaded version, i.e. using @threads infront of the loop. And this static load division is also a tad bit slower than your implementation which has better load balancing. I think this difference will be even greater if the load per task is not equal as it was in the test cases I used. I think one needs to strike a balance somewhere.

@mohamed82008
Copy link
Owner

mohamed82008 commented Jul 11, 2018

So dynamic load balancing should be better if it works, but it is causing allocations in the tmapreduce and tmapadd for reasons I don't quite get and it is triggering the closure bug in tmapreduce for some reason.

However the inference bug is not causing the performance hit, since the static version of tmapreduce also doesn't infer everything but is still faster than the unthreaded version as it allocates much less. So the allocations are the problem here.

The following timings were generated from:

    n = 10000000
    a = rand(n)
    println("threaded tmapreduce: $(Threads.nthreads()) threads")
    tmapreduce(log, +, 0., a)
    @time tmapreduce(log, +, 0., a)

    println("threaded tmapadd: $(Threads.nthreads()) threads")
    tmapadd(log, 0., a)
    @time tmapadd(log, 0., a)

    println("unthreaded")
    mapreduce(log, +, a, init=0.)
    @time mapreduce(log, +, a, init=0.)

Static:

threaded tmapreduce: 4 threads
  0.034197 seconds (30 allocations: 1008 bytes)
threaded tmapadd: 4 threads
  0.034678 seconds (16 allocations: 384 bytes)
unthreaded
  0.106056 seconds (18 allocations: 816 bytes)

Dynamic:

threaded tmapreduce: 4 threads
  0.647810 seconds (6.33 M allocations: 93.911 MiB)
threaded tmapadd: 4 threads
  0.636600 seconds (5.87 M allocations: 86.418 MiB)
unthreaded
  0.104249 seconds (18 allocations: 816 bytes)

@mohamed82008
Copy link
Owner

Relevant: JuliaLang/julia#27694

@bkamins
Copy link
Collaborator

bkamins commented Jul 11, 2018

Thanks. I was waiting for 0.7 to stabilize to update the package and then register it. In particular given a new randjump protocol it is simpler to use approach that is a modification of https://github.com/nacs-lab/yyc-data/blob/d082032d075070b133fe909c724ecc405e80526a/lib/NaCsCalc/src/utils.jl#L120-L142. I will work on it before JuliaCon2018 as I have a talk using this code then.

But Atomic regression is a serious issue. I guess it is fixable, so we simply have to wait. If not probably we should perform processing in batches. I will see what is best.

@mohamed82008
Copy link
Owner

mohamed82008 commented Jul 11, 2018

Wow the version with batches is doing really well, linear scaling. Still allocating but speed is good. I also found that a good heuristic for the batch size is min(n, round(Int, 10*sqrt(n))). This is based on some basic trial and error so there could be better rules out there. PR coming up next!

threaded tmapreduce: 4 threads
  0.024689 seconds (350 allocations: 6.000 KiB)
threaded tmapadd: 4 threads
  0.026095 seconds (342 allocations: 5.859 KiB)
unthreaded
  0.099201 seconds (18 allocations: 816 bytes)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants