RFC: "for-loop" compliant @parallel for.... take 2 #20259

amitmurthy · 2017-01-26T18:48:18Z

Rework of #20094

This addresses some of the concerns with @parallel as discussed at #19578 in a different manner.

@parallel for differs from a regular for-loop by

implementing a reducer functionality
treating the last line in the body as the value to be reduced
returning before the loop completes execution (non-reducer case)
folks are used to looping over local arrays and expect them to be updated. This only works with shared arrays.

This PR:

deprecates the reducer mode of @parallel for
provides a way for the user to explicitly specify accumulators and use them in the main body
can wait on the accumulator(s)

The usage is a bit more verbose, but there is much lesser scope for confusion or misplaced expectations.

@parallel reducer for x in unit_range
  body
end

will now be written as

acc = ParallelAccumulator(initial_value)
@parallel for x in unit_range
  body
  acc[] = reducer_f(acc[], iteration_value)
end
result = reduce(reducer_f, acc)

Multiple accumulators can be referenced

a = ParallelAccumulator(0.0)    # Specify an initial value. 
b = ParallelAccumulator(1.0)     
@parallel for i in 1:N
    a[] += foo(i)               # 0-arg indexation syntax for set/get. Similar to Ref.
                                
    b[] = min(b[], bar(i))   # In the loop any value can be assigned.
                             # The last value is sent to the master node
end
reduced_a = reduce(+, a)        # explicit reduce call for final reduction                  
reduced_b = reduce(min, b)

Updating shared arrays work as before, there is no need for ParallelAccumulators. However, ParallelAccumulators can be used in multi-node scenarios which shared memory cannot address.

As before the input range is partitioned across workers, local reductions performed with a final reduction on the caller.

I feel this syntax and loop behavior is more in-line with a regular for-loop. Updating arrays from the body is still not allowed (except if they are shared of course). ParallelAccumulators does cover that need.

ParallelAccumulator can also be used outside of a @parallel as shown below:

acc = ParallelAccumulator{Int}()
@sync for p in workers()
    @spawnat p begin
        for i in 1:10
            acc[] += i    # local accumulation
        end
        push!(acc)           # explicit push back to caller. This is implicitly done in case of `@parallel`
    end
end
result = reduce(+, acc)

API:

New export : ParallelAccumulator
acc[] - set/get current accumulated value on the workers
reduce(op, acc) - final reduction on the caller
push!(accumulator) - Used on workers when used in non-@parallel mode to explicitly push local reductions to caller.

Further changes:

Bikeshed ParallelAccumulator name.
any other API/usage suggestions.

Todo:

- NEWS
- manual update
- docstrings update
- deprecate.jl entry

tkelman · 2017-01-26T19:08:10Z

base/exports.jl

@@ -80,6 +80,7 @@ export
    ObjectIdDict,
    OrdinalRange,
    Pair,
+    ParallelAccumulator,


should go in stdlib doc index if exported

(oh, already in the todo)

doesn't need to be listed twice

amitmurthy · 2017-01-27T08:37:18Z

Ready for review. Will be good if a couple of folks in addition to @tkelman have a look.

clarkfitzg · 2017-01-30T18:54:52Z

New Julia user here, coming from Python, R, C. Pardon if my questions are naive, just looking to understand.

@aviks mentioned pmap in #19578. Looks like @parallel for and ParallelAccumulator essentially provide an alternate syntax for operations that can be done with the familiar map and reduce. Although there would need to be a parallel mapreduce() to directly reduce without an intermediate result.

The docs mention:

Julia’s pmap() is designed for the case where each function call does a large amount of work. In contrast, @parallel for can handle situations where each iteration is tiny, perhaps merely summing two numbers.

Curious, why is this the case? If they basically do the same thing then is it possible to share implementation?

clarkfitzg · 2017-01-30T18:57:55Z

base/multi.jl

-        t = Task(()->remotecall_fetch(f, pid, reducer, R, first(chunks[idx]), last(chunks[idx])))
-        schedule(t)
+    for (pid, r) in splitrange(length(R), workers())
+        t = @schedule remotecall_fetch(f, pid, reducer, R, first(r), last(r))


Does this do dynamic load balancing across workers?

No. The range is statically partitioned once.

JeffBezanson · 2017-01-30T22:05:39Z

How is the performance of this?

amitmurthy · 2017-01-31T02:53:03Z

How is the performance of this?

Terrible. I am ashamed to post numbers here. Will try to come up with an alternative in quick-time.

amitmurthy · 2017-01-31T03:00:43Z

@clarkfitzg , a couple of reasons

folks do find the for-loop syntax more natural and convenient
deprecating the reducer but retaining the distributed loop helps in working towards a single model combining all types of loops parallelism - multiple nodes, @threads for, @simd for and GPUs too.

amitmurthy · 2017-01-31T10:41:24Z

Have pushed an update. Numbers looking reasonable now.

With an empty body on 0.5, 4 workers:

julia> function foo(n)
           @parallel (+) for i in 1:n
               1
           end
       end;

julia> @time foo(10^9)
  0.000914 seconds (579 allocations: 46.250 KB)
1000000000

julia> @time foo(10^9)
  0.000934 seconds (562 allocations: 45.047 KB)
1000000000

This PR:

julia> function foo(n)
           a1 = ParallelAccumulator(+, 0)
           @parallel for i in 1:10^n
               push!(a1, 1)
           end
           take!(a1)
       end

julia> @time foo(9)
  0.002706 seconds (1.74 k allocations: 76.625 KiB)
1000000000

julia> @time foo(9)
  0.002627 seconds (1.75 k allocations: 77.266 KiB)
1000000000

With a minimal compute in the loop body

On 0.5

julia> function foo(n)
           @parallel (+) for i in 1:n
               rand()
           end
       end

julia> @time foo(10^9)
  0.665094 seconds (557 allocations: 44.531 KB)
5.0000125154458153e8

julia> @time foo(10^9)
  0.684806 seconds (569 allocations: 45.031 KB)
4.9999967851712906e8

On this PR:

julia> function foo(n)
           a1 = ParallelAccumulator(+, 0.0)
           @parallel for i in 1:n
               push!(a1, rand())
           end
           take!(a1)
       end

julia> @time foo(10^9)
  1.011015 seconds (1.80 k allocations: 78.375 KiB)
5.000151295992906e8

julia> @time foo(10^9)
  1.001658 seconds (1.85 k allocations: 72.500 KiB)
4.999923846122731e8

Given that the numbers are for 10^9 iterations, and that real world code will have more meaty compute, the numbers look very reasonable and can probably be improved a bit further too.

amitmurthy · 2017-01-31T10:45:19Z

FWIW, the ParallelAccumulator values can also be directly accessed via []. With

function foo(n)
    a1 = ParallelAccumulator(+, 0.0)
    @parallel for i in 1:n
        a1[] = a1[] + rand()
    end
    take!(a1)
end
foo(10);
@time foo(10^9)

the time is further improved to 0.78 seconds.

tkelman · 2017-02-02T14:53:04Z

base/multi.jl

+                end
+                global pacc_registry
+                for pacc in get(pacc_registry, rrid, [])
+                    push!(pacc)


what is this accomplishing?

tkelman

(only partway through this version, posting my comments so far)

tkelman · 2017-02-02T14:53:45Z

base/multi.jl

-with a final reduction on the calling process.
+The loop is executed in parallel across all workers, with each worker executing a subset
+of the range. The call waits for completion of all iterations on all workers before returning.
+Any updates to variables outside the loop body is not reflected on the calling node.


"Any updates ... are not"

tkelman · 2017-02-02T14:55:32Z

base/multi.jl

-Note that without a reducer function, `@parallel` executes asynchronously, i.e. it spawns
-independent tasks on all available workers and returns immediately without waiting for
-completion. To wait for completion, prefix the call with [`@sync`](@ref), like :
+Example with shared arrays:


see elsewhere for how examples are usually formatted - something like # SharedArray Example would be more consistent here

tkelman · 2017-02-02T15:36:47Z

base/multi.jl

+julia> c = 10;
+
+julia> @parallel for i=1:4
+         a[i] = i + c;


should make indent consistent with other doctests, and semicolon not needed here (after the end maybe, or show the return value of the @parallel)

tkelman · 2017-02-02T16:16:20Z

base/multi.jl

        loop = args[1]
-    elseif na==2
+    elseif na == 2
+        depwarn("@parallel with a reducer is deprecated. Use ParallelAccumulators for reduction.", Symbol("@parallel"))


this should leave a TODO note in deprecated.jl

tkelman · 2017-02-02T17:50:59Z

I'll continue reviewing this, but given it's a bit of a slowdown relative to ~~master~~ 0.5 I think we should hold off and not make this change for 0.6.

amitmurthy · 2017-02-03T02:22:23Z

but given it's a bit of a slowdown relative to 0.5 I think we should hold off and not make this change for 0.6.

I am fine with not doing this in 0.6, however my reason would be in order to get the interface right. The performance numbers do not concern me that much right now.

This change is more flexible relative to what we have now. We can have any number of parallel accumulators, and they can be used outside of @parallel for-loops too.
The major slowdown is seen only on a billion iterations of an empty for-body. This is not real world usage, with even a small compute the overhead of the new machinery will be rendered inconsequential.
We are in a feature freeze, get the API correct right now. There will be a couple of weeks to further improve performance.
With inputs from @shashi , I have further simplified the interface on my local branch. Unless there is a demand to get this in now, it can wait to be merged early on in the next release cycle.

shashi · 2017-02-03T07:08:45Z

I agree we should get this in with the right API for 0.6, performance improvements can come later.

@amitmurthy want to push the DRef thing here? or create a new PR?

tkelman · 2017-02-03T13:40:16Z

There are a lot of changes here and 0.6 feature freeze is already a month overdue. The discussion on the triage call is that there are more important things to focus on for getting 0.6 feature freeze and this can wait.

StefanKarpinski · 2017-02-03T21:16:40Z

There's also the issue that people are actually using our parallel for loops. If we tank their performance, that's not really ok.

amitmurthy · 2017-02-04T03:05:27Z

There's also the issue that people are actually using our parallel for loops. If we tank their performance, that's not really ok.

Of course it is not OK. However it is very important to get some perspective here. For an empty billion iterations (adding 1 in every iteration actually) distributed over 4 workers, slowdown is from 0.0009 to 0.002 seconds - this will be fixed distribution overhead.

For a minimal compute (a rand() in every iteration) - slowdown in the PR is from 0.67 (Julia 0.5) to 0.78 seconds - which I hope to totally remove once we freeze the API.

"tanking their performance" is a bit of a stretch.

amitmurthy · 2017-02-04T13:18:08Z

The slowdown seen in the case of summing floats is essentially #20452

amitmurthy · 2017-02-07T03:37:36Z

The "it's embarrassing" comment also mentioned a "fix" and the PR was updated accordingly and numbers posted. Two numbers are being discussed here:
One, an "empty" do-nothing case seeing a radical 2.5x slowdown. However, it is important to view that in the right perspective:
- Net, the 2.5x slowdown translated to an additional 1 millisecond in distributed mode. The additional 1 millisecond does not bother me because a) the replacement is more flexible, b) can support multiple reducers in a single loop c) is general enough to be used independent of @parallel for and finally d) the 1 millisecond extra is mostly network/serialization stuff on single node which will be subsumed by actual computation in real-world scenarios. In multi-node, I would expect even this 2.5x / 1 millisecond difference to disappear.
The other number reported a 15-30% slowdown using a rand() call to simulate "small loop computation". That was unexpected and has since been tracked down to a compiler optimization issue. It is being tracked in Performance difference between local Ref (allocated once) and a local float #20452.

StefanKarpinski · 2017-02-07T12:38:32Z

So there is no practical slowdown here?

amitmurthy · 2017-02-07T13:01:21Z

Practically, in real-world usage, IMO, there will not be.

Also, @shashi and myself have discussed a slightly improved version of the API. I'll update this PR with that today.

tkelman · 2017-02-07T19:14:34Z

base/parallel/macros.jl

+
+
+"""
+    push!(pacc::ParallelAccumulator)


1-arg push! doesn't make sense from an API standpoint. This has more to do with the communication than a collection operation.

Hmmm, trying to reuse an existing exported verb. send(pacc::ParallelAccumulator)?

amitmurthy · 2017-02-07T19:31:47Z

Updated.

Usage is now:

a = ParallelAccumulator(0.0)    # Specify an initial value. 
b = ParallelAccumulator(1.0)     
@parallel for i in 1:N
    a[] += foo(i)               # 0-arg indexation syntax for set/get. Similar to Ref.
                                
    b[] = min(b[], bar(i))   # In the loop any value can be assigned.
                             # The last value is sent to the master node
end
reduced_a = reduce(+, a)        # explicit reduce call for final reduction                  
reduced_b = reduce(min, b)

Timings(seconds) - each for 1 billion @parallel iterations with 8 local workers

Sum of	master	this PR
1.0	0.14	0.13
rand()	0.50	0.60
sqrt(i)	1.21	1.20

shashi · 2017-02-08T11:31:25Z

makes sense to rename ParallelAccumulator to DRef (for distributed version of Ref)?

StefanKarpinski · 2017-02-08T17:41:09Z

Since we spell out RemoteRef and such, wouldn't DistributedRef be better, despite being a bit of a mouthful?

Sacha0 · 2017-02-08T19:53:20Z

IIUC, this loop construct / the accumulator does not necessarily involve distribution or even parallelism, but rather merely concurrency? If so, perhaps referencing concurrency as opposed to distribution or parallelism would yield a more appropriate name? Ref. #20486 (comment). Best! Corrected misunderstanding: This accumulator is specifically for distributed computing. DistributedRef seems like a great name.

amitmurthy · 2017-02-09T06:17:54Z

IIUC, this loop construct / the accumulator does not necessarily involve distribution or even parallelism, but rather merely concurrency?

No, it is distributed and parallel rather than just concurrent going by the following definition.

"Concurrency is about dealing with lots of things at once. Parallelism is about doing lots of things at once." - Rob Pike (https://blog.golang.org/concurrency-is-not-parallelism)

DistributedRef sounds great.

StefanKarpinski · 2017-02-09T14:30:46Z

I've always found that quote bit unclear. I think of it this way:

concurrency: expressing that things could happen at the same time;
parallelism: things actually happening at the same time.

tkelman · 2017-02-09T14:40:19Z

Couldn't you use this API from async tasks without anything being distributed? AccumulatorRef maybe, if that's what it's for?

StefanKarpinski · 2017-02-09T14:49:57Z

Or maybe SyncRef since the key behavior seems to be synchronization?

amitmurthy · 2017-02-09T14:54:21Z

The accumulator type is designed to work with @parallel which distributes across processes. When used with @parallel local accumulations on the workers are pushed automatically to the calling node. While it can be used with local task-only parallelism, there are no benefits to doing so - any regular object or a regular Ref will do.

ParallelAccumulator best states its functionality - each worker is accumulating in parallel. DistributedRef captures its nature, a Ref which is distributed across workers - you do have to explicitly execute a final reduce on the caller.

amitmurthy · 2017-02-09T14:56:09Z

SyncRef - no, there is no barrier like functionality exposed here - calling reduce waits to collect accumulated values from all workers, however, synchronization is not an intent.

Sacha0 · 2017-02-11T18:53:17Z

With new perspective (#20486 (comment)), cheers for DistributedRef or DistributedAccumulator. Best!

tkelman · 2017-02-15T06:33:45Z

base/distributed/macros.jl

+Pushes the locally accumulated value to the calling node. Must be called once on each worker
+when a DistributedRef is used independent of a [`@parallel`](@ref) construct.
+"""
+push!(dref::DistributedRef) = put_nowait!(dref.chnl, (myid(), dref.value))


push doesn't make sense, it's not a growing collection - send would be better

tkelman · 2017-02-15T06:35:50Z

doc/src/manual/parallel-computing.md

-  Int(rand(Bool))
+acc = DistributedRef(0)
+@parallel for i=1:200000000
+  acc[] += Int(rand(Bool))


4 space indent in code examples

amitmurthy · 2017-02-15T06:51:43Z

Have renamed to DistributedRef.

If we are having this in 0.6, would like @JeffBezanson to review once before merging.

There is a partial overlap in the functionality (not the implementation which is quite different) between a DistributedRef and a DArray{T,1,T} https://github.com/JuliaParallel/DistributedArrays.jl#working-with-distributed-non-array-data.

DistributedRef is specifically optimized for use with an @parallel for loop - . With a DArray{T,1,T}, creating the distributed vector, local accumulations and a final reductions are distinct operations w.r.t. the messages being transported.

With a DistributedRef, the remote refs are created as part of the @parallel serialization, and local accumulations are sent back when the local loops terminate.

Mentioning it here because I believe at some point in the future, module Distributed, DistributedArrays and SharedArrays will be part of a single DistributedComputing package. At which time
DistributedRef may be replaced with a version of DArray{T,1,T} that works efficiently with @parallel.

However, for now I would like to see this PR merged in 0.6. It addresses two issues, 1) deprecation of the reducer mode in @parallel and 2) its replacement with a different style of reduction, namely explicitly dealing with a distributed ref.

musm · 2017-02-16T15:32:07Z

+1 for having this in 0.6 . Out of consistency should DArray be renamed to DistributedArray to parallel DistributedRef or DArray and DRef? Perhaps there is some other precedent for the DArray name im not aware of.

shashi · 2017-02-17T12:26:24Z

julia> x = DistributedRef(0)
DistributedRef{Int64}(0, 0, Set{Int64}(), RemoteChannel{Channel{Tuple}}(1, 1, 5), 0)

julia> @parallel for i=1:10
           x[] += i
       end

julia> reduce(+, x)
55

julia> reduce(+, x)
55

julia> @parallel for i=1:10
           x[] += i
       end

julia> reduce(+, x)
55

Would it be better for the second @parallel for to use the old accumulated values of x on the workers? i.e. the answer should be 110 here.

This change would make this abstraction much more powerful than it currently is. It could be used outside @parallel, for example with explicit remotecalls. A caveat is that we will need a function to free the values in the ref if a user wishes to do so. But I figure for most use cases, e.g. for just adding up numbers, you don't really need to care about it. You need such a thing now anyway if you are accumulating big objects (although you can do empty!(x.value) after the reduce, that's not really API)... This should also only result in deletion of some logic in setup and teardown of @parallel

StefanKarpinski · 2017-02-17T14:24:37Z

We should keep in mind that we'll probably have some kind of SyncedRef type that can be safely updated by multiple threads (Cilk has this kind of thing), so we should make the naming scheme coherent across parallel and distributed models.

ViralBShah · 2020-04-10T18:30:58Z

Perhaps too old to reuse. @shashi any help here?

vtjnash · 2024-02-10T22:46:33Z

Moved to JuliaLang/Distributed.jl#39

amitmurthy mentioned this pull request Jan 26, 2017

RFC/WIP: "for-loop" compliant @parallel for [ci skip] #20094

Closed

amitmurthy added this to the 0.6.0 milestone Jan 26, 2017

tkelman reviewed Jan 26, 2017

View reviewed changes

tkelman added deprecation This change introduces or involves a deprecation needs news A NEWS entry is required for this change parallelism Parallel or distributed computation labels Jan 26, 2017

amitmurthy force-pushed the amitm/parfor2 branch from 60ffb2c to 6aab891 Compare January 27, 2017 08:24

amitmurthy changed the title ~~RFC/WIP: "for-loop" compliant @parallel for.... take 2~~ RFC: "for-loop" compliant @parallel for.... take 2 Jan 27, 2017

StefanKarpinski requested a review from JeffBezanson January 27, 2017 20:28

amitmurthy mentioned this pull request Jan 30, 2017

things we should deprecate, 0.6 edition #19598

Closed

22 tasks

clarkfitzg reviewed Jan 30, 2017

View reviewed changes

amitmurthy force-pushed the amitm/parfor2 branch 2 times, most recently from d0ac82d to 912e102 Compare January 31, 2017 10:27

tkelman reviewed Feb 2, 2017

View reviewed changes

StefanKarpinski removed this from the 0.6.0 milestone Feb 2, 2017

amitmurthy force-pushed the amitm/parfor2 branch from 912e102 to 13774c2 Compare February 7, 2017 19:02

tkelman reviewed Feb 7, 2017

View reviewed changes

Sacha0 mentioned this pull request Feb 11, 2017

rename Parallel to Distributed #20486

Merged

"for-loop" compliant @parallel for

d296173

amitmurthy force-pushed the amitm/parfor2 branch from 50c2d81 to d296173 Compare February 15, 2017 06:32

tkelman reviewed Feb 15, 2017

View reviewed changes

vtjnash mentioned this pull request Feb 10, 2024

RFC: "for-loop" compliant @parallel for.... take 3 (with PR) JuliaLang/Distributed.jl#39

Open

vtjnash closed this Feb 10, 2024

vtjnash deleted the amitm/parfor2 branch February 10, 2024 22:46



		"""
		push!(pacc::ParallelAccumulator)

RFC: "for-loop" compliant @parallel for.... take 2 #20259

RFC: "for-loop" compliant @parallel for.... take 2 #20259

Conversation

amitmurthy commented Jan 26, 2017 • edited Loading

tkelman Jan 26, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amitmurthy commented Jan 27, 2017

clarkfitzg commented Jan 30, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JeffBezanson commented Jan 30, 2017

amitmurthy commented Jan 31, 2017

amitmurthy commented Jan 31, 2017

amitmurthy commented Jan 31, 2017

amitmurthy commented Jan 31, 2017

Choose a reason for hiding this comment

tkelman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tkelman commented Feb 2, 2017 • edited Loading

amitmurthy commented Feb 3, 2017

shashi commented Feb 3, 2017

tkelman commented Feb 3, 2017 • edited Loading

StefanKarpinski commented Feb 3, 2017

amitmurthy commented Feb 4, 2017

amitmurthy commented Feb 4, 2017

amitmurthy commented Feb 7, 2017

StefanKarpinski commented Feb 7, 2017

amitmurthy commented Feb 7, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amitmurthy commented Feb 7, 2017

shashi commented Feb 8, 2017 • edited Loading

StefanKarpinski commented Feb 8, 2017

Sacha0 commented Feb 8, 2017 • edited Loading

amitmurthy commented Feb 9, 2017

StefanKarpinski commented Feb 9, 2017

tkelman commented Feb 9, 2017

StefanKarpinski commented Feb 9, 2017

amitmurthy commented Feb 9, 2017

amitmurthy commented Feb 9, 2017 • edited Loading

Sacha0 commented Feb 11, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

amitmurthy commented Feb 15, 2017

musm commented Feb 16, 2017

shashi commented Feb 17, 2017 • edited Loading

StefanKarpinski commented Feb 17, 2017

ViralBShah commented Apr 10, 2020

vtjnash commented Feb 10, 2024

amitmurthy commented Jan 26, 2017 •

edited

Loading

tkelman Jan 26, 2017 •

edited

Loading

tkelman commented Feb 2, 2017 •

edited

Loading

tkelman commented Feb 3, 2017 •

edited

Loading

shashi commented Feb 8, 2017 •

edited

Loading

Sacha0 commented Feb 8, 2017 •

edited

Loading

amitmurthy commented Feb 9, 2017 •

edited

Loading

shashi commented Feb 17, 2017 •

edited

Loading