add tmapreduce #1

ChrisRackauckas · 2018-03-15T23:13:01Z

No description provided.

bkamins · 2018-03-18T10:15:52Z

I have made a preliminary write-up of map-reduce. I will try to run comprehensive tests next week. The problem with threaded map-reduce is that we have to assume that what f returns and op form Abelian group and in general the results will be not reproducible (e.g. when Float64 addition is considered).
I have also created a special function for addition which is faster than a general implementation as I guess this is the most common use case.

There are no such issues with tmap!.

fisiognomico · 2018-05-25T21:02:10Z

Hi!
The work done so far seems very promising, I have been working on writing some additional tests to count overhead contribution of trandjump(). I would also like to help you with tmapreduce!(), I will send you something in the next days.
Regarding imposing the commutativity of f and op is something which is a very hard problem to solve, and can be determined only in a few special cases (link)[https://link.springer.com/chapter/10.1007%2F978-3-662-46681-0_9].
Julia in reduce.jl seems not to cover the case where the operators are non-Abelian (which can be disregarded as user misbehaviour).
Anyway I will try to contribute, I hope that this can be made a PR soon, in any case tell me if you need any help!

bkamins · 2018-05-26T06:33:51Z

Sure! As you can see in the code I have tmapreduce and tmapadd (a special case that avoids locks). How would you see tmapreduce! to differ (i.e. I do not see ! needed - input data do not get modified here - as opposed to tmap! which overwrites dst vector).

The only thing I am waiting for with moving forward is Julia 0.7 to be out to be sure I optimize against a stabilized thing (especially as you probably know - threading model in Julia is experimental https://discourse.julialang.org/t/future-of-base-threads/10440 and might change in the future).

Anyway - help would be appreciated. As I have written in the other issue you have commented on I plan to talk about those things on JuliaCon 2018 so it would be great to work out the best practice to share with everyone.

mohamed82008 · 2018-07-11T06:39:39Z

I tried the tmapreduce, it had some bug so I fixed it, but I can't seem to get any speedup over the single-threaded mapreduce. The reason seems to be the closure bug introduced by the @threads macro, since the exact same code is inferred without the outer threads loop.

mohamed82008 · 2018-07-11T06:50:31Z

tmapadd is properly inferred but still allocates ALOT and is much slower than the unthreaded version. I will share my code in a branch.

mohamed82008 · 2018-07-11T06:56:19Z

The code can be found in https://github.com/mohamed82008/KissThreading.jl/tree/fixmapreduce which is based on the PR #2.

mohamed82008 · 2018-07-11T07:10:37Z

On the other hand a complete rework using static load division was made in https://github.com/mohamed82008/KissThreading.jl/tree/rework and all of tmap!, tmapreduce and tmapadd functions scale almost linearly with the number of threads.

I noticed the tmap! is still a bit slower than the simple threaded version, i.e. using @threads infront of the loop. And this static load division is also a tad bit slower than your implementation which has better load balancing. I think this difference will be even greater if the load per task is not equal as it was in the test cases I used. I think one needs to strike a balance somewhere.

mohamed82008 · 2018-07-11T07:14:12Z

So dynamic load balancing should be better if it works, but it is causing allocations in the tmapreduce and tmapadd for reasons I don't quite get and it is triggering the closure bug in tmapreduce for some reason.

However the inference bug is not causing the performance hit, since the static version of tmapreduce also doesn't infer everything but is still faster than the unthreaded version as it allocates much less. So the allocations are the problem here.

The following timings were generated from:

    n = 10000000
    a = rand(n)
    println("threaded tmapreduce: $(Threads.nthreads()) threads")
    tmapreduce(log, +, 0., a)
    @time tmapreduce(log, +, 0., a)

    println("threaded tmapadd: $(Threads.nthreads()) threads")
    tmapadd(log, 0., a)
    @time tmapadd(log, 0., a)

    println("unthreaded")
    mapreduce(log, +, a, init=0.)
    @time mapreduce(log, +, a, init=0.)

Static:

threaded tmapreduce: 4 threads
  0.034197 seconds (30 allocations: 1008 bytes)
threaded tmapadd: 4 threads
  0.034678 seconds (16 allocations: 384 bytes)
unthreaded
  0.106056 seconds (18 allocations: 816 bytes)

Dynamic:

threaded tmapreduce: 4 threads
  0.647810 seconds (6.33 M allocations: 93.911 MiB)
threaded tmapadd: 4 threads
  0.636600 seconds (5.87 M allocations: 86.418 MiB)
unthreaded
  0.104249 seconds (18 allocations: 816 bytes)

mohamed82008 · 2018-07-11T08:58:46Z

Relevant: JuliaLang/julia#27694

bkamins · 2018-07-11T14:47:35Z

Thanks. I was waiting for 0.7 to stabilize to update the package and then register it. In particular given a new randjump protocol it is simpler to use approach that is a modification of https://github.com/nacs-lab/yyc-data/blob/d082032d075070b133fe909c724ecc405e80526a/lib/NaCsCalc/src/utils.jl#L120-L142. I will work on it before JuliaCon2018 as I have a talk using this code then.

But Atomic regression is a serious issue. I guess it is fixable, so we simply have to wait. If not probably we should perform processing in batches. I will see what is best.

mohamed82008 · 2018-07-11T18:12:26Z

Wow the version with batches is doing really well, linear scaling. Still allocating but speed is good. I also found that a good heuristic for the batch size is min(n, round(Int, 10*sqrt(n))). This is based on some basic trial and error so there could be better rules out there. PR coming up next!

threaded tmapreduce: 4 threads
  0.024689 seconds (350 allocations: 6.000 KiB)
threaded tmapadd: 4 threads
  0.026095 seconds (342 allocations: 5.859 KiB)
unthreaded
  0.099201 seconds (18 allocations: 816 bytes)

bkamins mentioned this issue Jul 11, 2018

Upgrade to v0.7 and test organization #2

Merged

mohamed82008 mentioned this issue Jul 11, 2018

Fixes tmapreduce and tmapadd for v0.7 using batch loads #3

Merged

bkamins closed this as completed in #3 Jul 20, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add tmapreduce #1

add tmapreduce #1

ChrisRackauckas commented Mar 15, 2018

bkamins commented Mar 18, 2018

fisiognomico commented May 25, 2018

bkamins commented May 26, 2018

mohamed82008 commented Jul 11, 2018 •

edited

Loading

mohamed82008 commented Jul 11, 2018 •

edited

Loading

mohamed82008 commented Jul 11, 2018 •

edited

Loading

mohamed82008 commented Jul 11, 2018

mohamed82008 commented Jul 11, 2018 •

edited

Loading

mohamed82008 commented Jul 11, 2018

bkamins commented Jul 11, 2018

mohamed82008 commented Jul 11, 2018 •

edited

Loading

add tmapreduce #1

add tmapreduce #1

Comments

ChrisRackauckas commented Mar 15, 2018

bkamins commented Mar 18, 2018

fisiognomico commented May 25, 2018

bkamins commented May 26, 2018

mohamed82008 commented Jul 11, 2018 • edited Loading

mohamed82008 commented Jul 11, 2018 • edited Loading

mohamed82008 commented Jul 11, 2018 • edited Loading

mohamed82008 commented Jul 11, 2018

mohamed82008 commented Jul 11, 2018 • edited Loading

mohamed82008 commented Jul 11, 2018

bkamins commented Jul 11, 2018

mohamed82008 commented Jul 11, 2018 • edited Loading

mohamed82008 commented Jul 11, 2018 •

edited

Loading

mohamed82008 commented Jul 11, 2018 •

edited

Loading

mohamed82008 commented Jul 11, 2018 •

edited

Loading

mohamed82008 commented Jul 11, 2018 •

edited

Loading

mohamed82008 commented Jul 11, 2018 •

edited

Loading