-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: "for-loop" compliant @parallel for.... take 2 #20259
Conversation
base/exports.jl
Outdated
@@ -80,6 +80,7 @@ export | |||
ObjectIdDict, | |||
OrdinalRange, | |||
Pair, | |||
ParallelAccumulator, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should go in stdlib doc index if exported
(oh, already in the todo)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't need to be listed twice
60ffb2c
to
6aab891
Compare
Ready for review. Will be good if a couple of folks in addition to @tkelman have a look. |
New Julia user here, coming from Python, R, C. Pardon if my questions are naive, just looking to understand. @aviks mentioned The docs mention:
Curious, why is this the case? If they basically do the same thing then is it possible to share implementation? |
base/multi.jl
Outdated
t = Task(()->remotecall_fetch(f, pid, reducer, R, first(chunks[idx]), last(chunks[idx]))) | ||
schedule(t) | ||
for (pid, r) in splitrange(length(R), workers()) | ||
t = @schedule remotecall_fetch(f, pid, reducer, R, first(r), last(r)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this do dynamic load balancing across workers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. The range is statically partitioned once.
How is the performance of this? |
Terrible. I am ashamed to post numbers here. Will try to come up with an alternative in quick-time. |
@clarkfitzg , a couple of reasons
|
d0ac82d
to
912e102
Compare
Have pushed an update. Numbers looking reasonable now. With an empty body on 0.5, 4 workers:
This PR:
With a minimal compute in the loop body On 0.5
On this PR:
Given that the numbers are for 10^9 iterations, and that real world code will have more meaty compute, the numbers look very reasonable and can probably be improved a bit further too. |
FWIW, the ParallelAccumulator values can also be directly accessed via
the time is further improved to 0.78 seconds. |
base/multi.jl
Outdated
end | ||
global pacc_registry | ||
for pacc in get(pacc_registry, rrid, []) | ||
push!(pacc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this accomplishing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(only partway through this version, posting my comments so far)
base/multi.jl
Outdated
with a final reduction on the calling process. | ||
The loop is executed in parallel across all workers, with each worker executing a subset | ||
of the range. The call waits for completion of all iterations on all workers before returning. | ||
Any updates to variables outside the loop body is not reflected on the calling node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Any updates ... are not"
base/multi.jl
Outdated
Note that without a reducer function, `@parallel` executes asynchronously, i.e. it spawns | ||
independent tasks on all available workers and returns immediately without waiting for | ||
completion. To wait for completion, prefix the call with [`@sync`](@ref), like : | ||
Example with shared arrays: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see elsewhere for how examples are usually formatted - something like # SharedArray Example
would be more consistent here
base/multi.jl
Outdated
julia> c = 10; | ||
|
||
julia> @parallel for i=1:4 | ||
a[i] = i + c; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should make indent consistent with other doctests, and semicolon not needed here (after the end
maybe, or show the return value of the @parallel
)
base/multi.jl
Outdated
loop = args[1] | ||
elseif na==2 | ||
elseif na == 2 | ||
depwarn("@parallel with a reducer is deprecated. Use ParallelAccumulators for reduction.", Symbol("@parallel")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should leave a TODO note in deprecated.jl
I'll continue reviewing this, but given it's a bit of a slowdown relative to |
I am fine with not doing this in 0.6, however my reason would be in order to get the interface right. The performance numbers do not concern me that much right now.
|
I agree we should get this in with the right API for 0.6, performance improvements can come later. @amitmurthy want to push the DRef thing here? or create a new PR? |
There are a lot of changes here and 0.6 feature freeze is already a month overdue. The discussion on the triage call is that there are more important things to focus on for getting 0.6 feature freeze and this can wait. |
There's also the issue that people are actually using our parallel for loops. If we tank their performance, that's not really ok. |
Of course it is not OK. However it is very important to get some perspective here. For an empty billion iterations (adding 1 in every iteration actually) distributed over 4 workers, slowdown is from 0.0009 to 0.002 seconds - this will be fixed distribution overhead. For a minimal compute (a "tanking their performance" is a bit of a stretch. |
The slowdown seen in the case of summing floats is essentially #20452 |
|
So there is no practical slowdown here? |
Practically, in real-world usage, IMO, there will not be. Also, @shashi and myself have discussed a slightly improved version of the API. I'll update this PR with that today. |
912e102
to
13774c2
Compare
base/parallel/macros.jl
Outdated
|
||
|
||
""" | ||
push!(pacc::ParallelAccumulator) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1-arg push!
doesn't make sense from an API standpoint. This has more to do with the communication than a collection operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, trying to reuse an existing exported verb. send(pacc::ParallelAccumulator)
?
Updated. Usage is now:
Timings(seconds) - each for 1 billion
|
makes sense to rename |
Since we spell out |
|
No, it is distributed and parallel rather than just concurrent going by the following definition. "Concurrency is about dealing with lots of things at once. Parallelism is about doing lots of things at once." - Rob Pike (https://blog.golang.org/concurrency-is-not-parallelism)
|
I've always found that quote bit unclear. I think of it this way:
|
Couldn't you use this API from async tasks without anything being distributed? |
Or maybe |
The accumulator type is designed to work with
|
|
With new perspective (#20486 (comment)), cheers for |
50c2d81
to
d296173
Compare
Pushes the locally accumulated value to the calling node. Must be called once on each worker | ||
when a DistributedRef is used independent of a [`@parallel`](@ref) construct. | ||
""" | ||
push!(dref::DistributedRef) = put_nowait!(dref.chnl, (myid(), dref.value)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
push doesn't make sense, it's not a growing collection - send would be better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree.
Int(rand(Bool)) | ||
acc = DistributedRef(0) | ||
@parallel for i=1:200000000 | ||
acc[] += Int(rand(Bool)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4 space indent in code examples
Have renamed to If we are having this in 0.6, would like @JeffBezanson to review once before merging. There is a partial overlap in the functionality (not the implementation which is quite different) between a
With a Mentioning it here because I believe at some point in the future, module However, for now I would like to see this PR merged in 0.6. It addresses two issues, 1) deprecation of the reducer mode in |
+1 for having this in 0.6 . Out of consistency should DArray be renamed to DistributedArray to parallel DistributedRef or DArray and DRef? Perhaps there is some other precedent for the DArray name im not aware of. |
julia> x = DistributedRef(0)
DistributedRef{Int64}(0, 0, Set{Int64}(), RemoteChannel{Channel{Tuple}}(1, 1, 5), 0)
julia> @parallel for i=1:10
x[] += i
end
julia> reduce(+, x)
55
julia> reduce(+, x)
55
julia> @parallel for i=1:10
x[] += i
end
julia> reduce(+, x)
55 Would it be better for the second This change would make this abstraction much more powerful than it currently is. It could be used outside |
We should keep in mind that we'll probably have some kind of |
Perhaps too old to reuse. @shashi any help here? |
Moved to JuliaLang/Distributed.jl#39 |
Rework of #20094
This addresses some of the concerns with
@parallel
as discussed at #19578 in a different manner.@parallel for
differs from a regular for-loop byThis PR:
@parallel for
The usage is a bit more verbose, but there is much lesser scope for confusion or misplaced expectations.
will now be written as
Multiple accumulators can be referenced
Updating shared arrays work as before, there is no need for ParallelAccumulators. However, ParallelAccumulators can be used in multi-node scenarios which shared memory cannot address.
As before the input range is partitioned across workers, local reductions performed with a final reduction on the caller.
I feel this syntax and loop behavior is more in-line with a regular for-loop. Updating arrays from the body is still not allowed (except if they are shared of course). ParallelAccumulators does cover that need.
ParallelAccumulator can also be used outside of a
@parallel
as shown below:API:
ParallelAccumulator
acc[]
- set/get current accumulated value on the workersreduce(op, acc)
- final reduction on the callerpush!(accumulator)
- Used on workers when used in non-@parallel
mode to explicitly push local reductions to caller.Further changes:
ParallelAccumulator
name.Todo: