avoid using `@sync_add` on remotecalls #44671

exaexa · 2022-03-18T16:42:50Z

It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which in turn calls wait() for all the futures synchronously and serially (here: https://github.com/JuliaLang/julia/blob/v1.7.2/base/task.jl#L358). Not only that is slightly detrimental for network operations (latencies add up), but in case of Distributed the call to wait() may actually cause some compilation on remote processes, which is also wait()ed for. In result, some operations took a great amount of "serial" processing time if executed on many workers at once.

For me, this closes #44645.

The major change can be illustrated as follows:

First add some workers:

using Distributed
addprocs(10)

and then trigger something that, for example, causes package imports on the workers:

using SomeTinyPackage

In my case (importing UnicodePlots on 10 workers), this improves the loading time over 10 workers from ~11s to ~5.5s.

This is a far bigger issue when worker count gets high. The time of the processing on each worker is usually around 0.3s, so triggering this problem even on a relatively small cluster (64 workers) causes a really annoying delay, and running @everywhere for the first time on reasonable clusters (I tested with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks.

Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s, and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't bother to measure that precisely now, sorry) to ~11s.

Related issues:

Probably fixes @everywhere is slow on HPC with multi-node environment #39291.
reenable the precompile generation for Distributed #42156 is a kinda complementary -- it removes the most painful source of slowness (the 0.3s precompilation on the workers), but the fact that the wait()ing is serial remains a problem if the network latencies are high.

@sync

It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which in turn calls wait() for all the futures synchronously. Not only that is slightly detrimental for network operations (latencies add up), but in case of Distributed the call to wait() may actually cause some compilation on remote processes, which is also wait()ed for. In result, some operations took a great amount of "serial" processing time if executed on many workers at once. For me, this closes #44645. The major change can be illustrated as follows: First add some workers: ``` using Distributed addprocs(10) ``` and then trigger something that, for example, causes package imports on the workers: ``` using SomeTinyPackage ``` In my case (importing UnicodePlots on 10 workers), this improves the loading time over 10 workers from ~11s to ~5.5s. This is a far bigger issue when worker count gets high. The time of the processing on each worker is usually around 0.3s, so triggering this problem even on a relatively small cluster (64 workers) causes a really annoying delay, and running `@everywhere` for the first time on reasonable clusters (I tested with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks. Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s, and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't bother to measure that precisely now, sorry) to ~11s. Related issues: - Probably fixes #39291. - #42156 is a kinda complementary -- it removes the most painful source of slowness (the 0.3s precompilation on the workers), but the fact that the wait()ing is serial remains a problem if the network latencies are high.

exaexa · 2022-03-18T17:33:02Z

Side thought: The use of @sync_add in Distributed was actually the only use in whole Julia. Since it's also not exported from Base nor officially documented, it might be feasible to slash it.

 ~/work/julia $ grep sync_add * -r
base/task.jl:macro sync_add(expr)
 ~/work/julia $

exaexa · 2022-03-18T18:39:58Z

PS, illustrated relation to #42156:

 ~/work/julia $ time ./julia -p16 -e 'using Distributed; @everywhere 1+1'

# unpatched 1.7.2
real	0m27.972s
user	2m17.593s
sys	0m5.705s

# this PR
real	0m18.705s
user	1m38.069s
sys	0m4.570s

# 42156 only
real	0m10.502s
user	0m29.906s
sys	0m3.604s

# both PRs
real	0m6.800s
user	0m33.056s
sys	0m4.423s

jpsamaroo · 2022-03-18T19:28:36Z

@exaexa since I know you have access to some large node counts, can you validate that this actually alleviates #39291 (and show timing results here)?

exaexa · 2022-03-18T23:27:04Z

can you validate that this actually alleviates #39291 (and show timing results here)?

Sure, it's only going to take some time (need to wrap the whole thing in slurm and compile this branch on the HPC). Hopefully tomorrow. Feel free to suggest whatever benchmark in the meantime; now I think I'm doing just @everywhere 1+1 (because that's so simple that the slowdown hurts most) and @everywhere using JuMP (because that's literally what we need solved, ha).

exaexa · 2022-03-19T10:34:30Z

A slightly more rigorous benchmark

The script:

using Distributed, ClusterManagers
n_workers = parse(Int, ENV["SLURM_NTASKS"])
@info "starting" n_workers
t = @timed addprocs_slurm(n_workers, topology = :master_worker)
@info "finished!" t.time
t = @timed @everywhere 1+1     #alternatively: @timed @everywhere using JuMP
@info "1+1d" t.time

Slurm batch:

#!/bin/sh
#SBATCH -t 20
#SBATCH -c 1
#SBATCH -n 256   # or 1024
#SBATCH --mem-per-cpu 3G

JULIA=$HOME/work/julia-git/julia

time $JULIA testfile.jl

Hardware: cpu nodes on Iris cluster of Uni.Lu HPC, see https://hpc-docs.uni.lu/systems/iris/

Results:

version	test	256w `@info`	256w `time` wallclock	1024w `@info`	1024w `time` wallclock
1.7.2	1+1	2m14s	2m38s	9m43s	10m28s
`fda2c82`	1+1	0.58s	21s	0.66s	50s
1.7.2	using JuMP	3m26s	4m8s	11m1s	11m44s
`fda2c82`	using JuMP	11s	33s	13s	1m13s

I'd say this pretty much confirms the hypothesis.

I didn't measure #42156, but from what I've seen from testing locally, this roughly describes the differences:

1.7.2 takes roughly O(local_work + n_workers*(latency + long_precompile_time))
reenable the precompile generation for Distributed #42156 takes just O(local_work + n_workers * small_roundtrip_latency) (waaaaay faster but still not "scalable" per se)
this takes O(local_work + n_workers * negligible_packet_send_time + long_precompile_time) (the Distributed precompilation is present, but done in parallel way so it doesn't hurt at all. Also this is the most scalable we can get without reaching to complicated task spawning & result collection schemes)

vtjnash · 2022-03-21T04:06:37Z

SGTM. I think remotecall is supposed to be cheap, but makes sense we may need to launch a Task to handle the cases when it is not cheap.

Need to fix the error return type in the test though.

exaexa · 2022-03-21T07:07:30Z

SGTM. I think remotecall is supposed to be cheap, but makes sense we may need to launch a Task to handle the cases when it is not cheap.

Yeah remotecall itself is super-fast (I'd say it wouldn't even need to be asynchronous, except for dodging an extra tiny piece of network latency); the "expensive" action that we're dodging here is what gets triggered on the remote from @sync after you use @sync_add on the Future returned from the remotecall.

Need to fix the error return type in the test though.

You mean this one right? https://buildkite.com/julialang/julia-master/builds/10190#d6e3e440-c11b-46c5-9ab1-489cea189b10/361-912

(I got lost in which tests here are "expectably failing" and which ones should not fail)

Is there any good way to unwrap the right choice of TaskFailedExceptions? It seems to me that doing that properly would require adding some wrapper logic so that the Task gets started right away but it dodges the TaskFailedException wrap in sync_end... Alternatively I can catch the exceptions manually in remotecall_eval, mark them, and then rethrow them all correctly unwrapped after the sync is done.

exaexa · 2022-03-21T08:55:38Z

re exceptions, it seems there's no other way around than implementing a small unwrapping helper (I was pointed to a related issue here: #38931).

I'll push my attempt later today. :D

exaexa · 2022-03-21T09:31:51Z

This what I pushed is a "slightly less painful" way to do that, with less code duplication (still needs a bit of polishing). It is a bit suboptimal because it first creates the TaskFailedException and then forcibly unwraps it, while the "optimal" way would be to reimplement a lot of the Task to completely avoid this wrap-unwrap. I guess having less code and almost-negligible-cost wrap&unwrap is the better option. Opinions/recommendations welcome though.

~~Also seems to kinda solve #38931 .~~ almost, we'd need the same wrapper for @spawn.

base/task.jl

stdlib/Distributed/src/Distributed.jl

base/task.jl

May help with #38931

Co-authored-by: Valentin Churavy <vchuravy@users.noreply.github.com>

(as it was done originally with `@sync_add`)

tkf · 2022-03-23T05:50:40Z

Yeah remotecall itself is super-fast (I'd say it wouldn't even need to be asynchronous, except for dodging an extra tiny piece of network latency); the "expensive" action that we're dodging here is what gets triggered on the remote from @sync after you use @sync_add on the Future returned from the remotecall.

Maybe it's just me, but I find this description (and the one in the OP) confusing. Whatever happens with @sync_add remotecall(f, p) has to happen with @async unwrap remotecall_wait(f, p) for Distributed to function properly. Perhaps a more verbose way to describe this is that, since wait(::Future) requires several IOs (presumably), it is beneficial to interleave these IOs. For example, the primary process can start fetching from worker B before fetching from worker A completes. So, I find it easier to understand if the explanation is that we are increasing overall performance by interleaving expensive actions (IOs), and not by dodging anything. Does this make sense?

exaexa · 2022-03-23T08:42:14Z

@tkf yes, it seems that the second "dodging" in my explanation wasn't the best word choice. 😅

Just to make it perfectly clear,

the actions performed with and without this PR are the same
the main difference is that calls for wait() here are forcibly interleaved (by @async), which is not the case of naive wait()s in sync_end()
the main source of slowness here was that the first wait() caused compilation on remote side (>0.2s in my case). In consequence, calling these serially really hurt. (If Distributed got precompiled (reenable the precompile generation for Distributed #42156), the main source of slowness would mostly disappear, but the wait()s would still be serial (in my case ~0.05s per call) thus still a scalability problem.)

tkf · 2022-03-23T09:00:26Z

Thanks for clarification. I was also wondering if the main benefits are from the "pre-computation" I/O in remotecall or the "post-computation" I/O in fetch(::Future).

exaexa · 2022-03-23T10:22:45Z

🎉

Thanks everyone for comments&hints!

exaexa · 2022-03-23T10:44:01Z

PS @KristofferC is there any chance for backporting this to 1.6 or 1.7 branches? If so, how do I start it? (open PRs against release-1.[67] ?)

oscardssmith · 2022-03-23T12:59:37Z

imo, we don't really need to back-port to 1.7 since anyone using 1.7 will probably switch to 1.8 soon (and it's not a bugfix). That said, backporting to 1.6 probably makes sense since large clusters are more likely than most to stick to LTS.

jishnub · 2022-03-23T13:49:13Z

anyone using 1.7 will probably switch to 1.8 soon (and it's not a bugfix).

Is this being backported to 1.8? That would be nice. I don't see a label though

vtjnash · 2022-03-23T16:32:39Z

Now that I understand better, could you try just using Experimental.@sync here? I think that is written to avoid the problems you described.

exaexa · 2022-03-23T16:45:47Z

Now that I understand better, could you try just using Experimental.@sync here? I think that is written to avoid the problems you described.

Yeah it seems like that should work too, although it clearly shows some kind of event counting which I'm not a fan of (it's breakable). Also, if I read correctly, it's only able to pick up a single exception from the failing tasks, right?

@sync

* avoid using `@sync_add` on remotecalls It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which in turn calls wait() for all the futures synchronously. Not only that is slightly detrimental for network operations (latencies add up), but in case of Distributed the call to wait() may actually cause some compilation on remote processes, which is also wait()ed for. In result, some operations took a great amount of "serial" processing time if executed on many workers at once. For me, this closes #44645. The major change can be illustrated as follows: First add some workers: ``` using Distributed addprocs(10) ``` and then trigger something that, for example, causes package imports on the workers: ``` using SomeTinyPackage ``` In my case (importing UnicodePlots on 10 workers), this improves the loading time over 10 workers from ~11s to ~5.5s. This is a far bigger issue when worker count gets high. The time of the processing on each worker is usually around 0.3s, so triggering this problem even on a relatively small cluster (64 workers) causes a really annoying delay, and running `@everywhere` for the first time on reasonable clusters (I tested with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks. Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s, and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't bother to measure that precisely now, sorry) to ~11s. Related issues: - Probably fixes #39291. - #42156 is a kinda complementary -- it removes the most painful source of slowness (the 0.3s precompilation on the workers), but the fact that the wait()ing is serial remains a problem if the network latencies are high. May help with #38931 Co-authored-by: Valentin Churavy <vchuravy@users.noreply.github.com> (cherry picked from commit 62e0729)

giordano · 2022-04-20T09:29:10Z

Since there may be a new version in v1.7.x series I added the backport 1.7 label

@sync

* avoid using `@sync_add` on remotecalls It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which in turn calls wait() for all the futures synchronously. Not only that is slightly detrimental for network operations (latencies add up), but in case of Distributed the call to wait() may actually cause some compilation on remote processes, which is also wait()ed for. In result, some operations took a great amount of "serial" processing time if executed on many workers at once. For me, this closes #44645. The major change can be illustrated as follows: First add some workers: ``` using Distributed addprocs(10) ``` and then trigger something that, for example, causes package imports on the workers: ``` using SomeTinyPackage ``` In my case (importing UnicodePlots on 10 workers), this improves the loading time over 10 workers from ~11s to ~5.5s. This is a far bigger issue when worker count gets high. The time of the processing on each worker is usually around 0.3s, so triggering this problem even on a relatively small cluster (64 workers) causes a really annoying delay, and running `@everywhere` for the first time on reasonable clusters (I tested with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks. Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s, and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't bother to measure that precisely now, sorry) to ~11s. Related issues: - Probably fixes #39291. - #42156 is a kinda complementary -- it removes the most painful source of slowness (the 0.3s precompilation on the workers), but the fact that the wait()ing is serial remains a problem if the network latencies are high. May help with #38931 Co-authored-by: Valentin Churavy <vchuravy@users.noreply.github.com> (cherry picked from commit 62e0729)

@sync

* avoid using `@sync_add` on remotecalls It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which in turn calls wait() for all the futures synchronously. Not only that is slightly detrimental for network operations (latencies add up), but in case of Distributed the call to wait() may actually cause some compilation on remote processes, which is also wait()ed for. In result, some operations took a great amount of "serial" processing time if executed on many workers at once. For me, this closes #44645. The major change can be illustrated as follows: First add some workers: ``` using Distributed addprocs(10) ``` and then trigger something that, for example, causes package imports on the workers: ``` using SomeTinyPackage ``` In my case (importing UnicodePlots on 10 workers), this improves the loading time over 10 workers from ~11s to ~5.5s. This is a far bigger issue when worker count gets high. The time of the processing on each worker is usually around 0.3s, so triggering this problem even on a relatively small cluster (64 workers) causes a really annoying delay, and running `@everywhere` for the first time on reasonable clusters (I tested with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks. Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s, and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't bother to measure that precisely now, sorry) to ~11s. Related issues: - Probably fixes #39291. - #42156 is a kinda complementary -- it removes the most painful source of slowness (the 0.3s precompilation on the workers), but the fact that the wait()ing is serial remains a problem if the network latencies are high. May help with #38931 Co-authored-by: Valentin Churavy <vchuravy@users.noreply.github.com> (cherry picked from commit 62e0729)

@sync

* avoid using `@sync_add` on remotecalls It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which in turn calls wait() for all the futures synchronously. Not only that is slightly detrimental for network operations (latencies add up), but in case of Distributed the call to wait() may actually cause some compilation on remote processes, which is also wait()ed for. In result, some operations took a great amount of "serial" processing time if executed on many workers at once. For me, this closes #44645. The major change can be illustrated as follows: First add some workers: ``` using Distributed addprocs(10) ``` and then trigger something that, for example, causes package imports on the workers: ``` using SomeTinyPackage ``` In my case (importing UnicodePlots on 10 workers), this improves the loading time over 10 workers from ~11s to ~5.5s. This is a far bigger issue when worker count gets high. The time of the processing on each worker is usually around 0.3s, so triggering this problem even on a relatively small cluster (64 workers) causes a really annoying delay, and running `@everywhere` for the first time on reasonable clusters (I tested with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks. Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s, and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't bother to measure that precisely now, sorry) to ~11s. Related issues: - Probably fixes #39291. - #42156 is a kinda complementary -- it removes the most painful source of slowness (the 0.3s precompilation on the workers), but the fact that the wait()ing is serial remains a problem if the network latencies are high. May help with #38931 Co-authored-by: Valentin Churavy <vchuravy@users.noreply.github.com> (cherry picked from commit 62e0729)

@sync

* avoid using `@sync_add` on remotecalls It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which in turn calls wait() for all the futures synchronously. Not only that is slightly detrimental for network operations (latencies add up), but in case of Distributed the call to wait() may actually cause some compilation on remote processes, which is also wait()ed for. In result, some operations took a great amount of "serial" processing time if executed on many workers at once. For me, this closes #44645. The major change can be illustrated as follows: First add some workers: ``` using Distributed addprocs(10) ``` and then trigger something that, for example, causes package imports on the workers: ``` using SomeTinyPackage ``` In my case (importing UnicodePlots on 10 workers), this improves the loading time over 10 workers from ~11s to ~5.5s. This is a far bigger issue when worker count gets high. The time of the processing on each worker is usually around 0.3s, so triggering this problem even on a relatively small cluster (64 workers) causes a really annoying delay, and running `@everywhere` for the first time on reasonable clusters (I tested with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks. Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s, and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't bother to measure that precisely now, sorry) to ~11s. Related issues: - Probably fixes #39291. - #42156 is a kinda complementary -- it removes the most painful source of slowness (the 0.3s precompilation on the workers), but the fact that the wait()ing is serial remains a problem if the network latencies are high. May help with #38931 Co-authored-by: Valentin Churavy <vchuravy@users.noreply.github.com> (cherry picked from commit 62e0729)

@sync

* avoid using `@sync_add` on remotecalls It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which in turn calls wait() for all the futures synchronously. Not only that is slightly detrimental for network operations (latencies add up), but in case of Distributed the call to wait() may actually cause some compilation on remote processes, which is also wait()ed for. In result, some operations took a great amount of "serial" processing time if executed on many workers at once. For me, this closes #44645. The major change can be illustrated as follows: First add some workers: ``` using Distributed addprocs(10) ``` and then trigger something that, for example, causes package imports on the workers: ``` using SomeTinyPackage ``` In my case (importing UnicodePlots on 10 workers), this improves the loading time over 10 workers from ~11s to ~5.5s. This is a far bigger issue when worker count gets high. The time of the processing on each worker is usually around 0.3s, so triggering this problem even on a relatively small cluster (64 workers) causes a really annoying delay, and running `@everywhere` for the first time on reasonable clusters (I tested with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks. Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s, and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't bother to measure that precisely now, sorry) to ~11s. Related issues: - Probably fixes #39291. - #42156 is a kinda complementary -- it removes the most painful source of slowness (the 0.3s precompilation on the workers), but the fact that the wait()ing is serial remains a problem if the network latencies are high. May help with #38931 Co-authored-by: Valentin Churavy <vchuravy@users.noreply.github.com> (cherry picked from commit 62e0729)

@sync

* avoid using `@sync_add` on remotecalls It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which in turn calls wait() for all the futures synchronously. Not only that is slightly detrimental for network operations (latencies add up), but in case of Distributed the call to wait() may actually cause some compilation on remote processes, which is also wait()ed for. In result, some operations took a great amount of "serial" processing time if executed on many workers at once. For me, this closes #44645. The major change can be illustrated as follows: First add some workers: ``` using Distributed addprocs(10) ``` and then trigger something that, for example, causes package imports on the workers: ``` using SomeTinyPackage ``` In my case (importing UnicodePlots on 10 workers), this improves the loading time over 10 workers from ~11s to ~5.5s. This is a far bigger issue when worker count gets high. The time of the processing on each worker is usually around 0.3s, so triggering this problem even on a relatively small cluster (64 workers) causes a really annoying delay, and running `@everywhere` for the first time on reasonable clusters (I tested with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks. Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s, and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't bother to measure that precisely now, sorry) to ~11s. Related issues: - Probably fixes #39291. - #42156 is a kinda complementary -- it removes the most painful source of slowness (the 0.3s precompilation on the workers), but the fact that the wait()ing is serial remains a problem if the network latencies are high. May help with #38931 Co-authored-by: Valentin Churavy <vchuravy@users.noreply.github.com> (cherry picked from commit 62e0729)

@sync

* avoid using `@sync_add` on remotecalls It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which in turn calls wait() for all the futures synchronously. Not only that is slightly detrimental for network operations (latencies add up), but in case of Distributed the call to wait() may actually cause some compilation on remote processes, which is also wait()ed for. In result, some operations took a great amount of "serial" processing time if executed on many workers at once. For me, this closes JuliaLang/julia#44645. The major change can be illustrated as follows: First add some workers: ``` using Distributed addprocs(10) ``` and then trigger something that, for example, causes package imports on the workers: ``` using SomeTinyPackage ``` In my case (importing UnicodePlots on 10 workers), this improves the loading time over 10 workers from ~11s to ~5.5s. This is a far bigger issue when worker count gets high. The time of the processing on each worker is usually around 0.3s, so triggering this problem even on a relatively small cluster (64 workers) causes a really annoying delay, and running `@everywhere` for the first time on reasonable clusters (I tested with 1024 workers, see JuliaLang/julia#44645) usually takes more than 5 minutes. Which sucks. Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s, and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't bother to measure that precisely now, sorry) to ~11s. Related issues: - Probably fixes JuliaLang/julia#39291. - JuliaLang/julia#42156 is a kinda complementary -- it removes the most painful source of slowness (the 0.3s precompilation on the workers), but the fact that the wait()ing is serial remains a problem if the network latencies are high. May help with JuliaLang/julia#38931 Co-authored-by: Valentin Churavy <vchuravy@users.noreply.github.com> (cherry picked from commit 3b57a49)

@sync

* avoid using `@sync_add` on remotecalls It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which in turn calls wait() for all the futures synchronously. Not only that is slightly detrimental for network operations (latencies add up), but in case of Distributed the call to wait() may actually cause some compilation on remote processes, which is also wait()ed for. In result, some operations took a great amount of "serial" processing time if executed on many workers at once. For me, this closes #44645. The major change can be illustrated as follows: First add some workers: ``` using Distributed addprocs(10) ``` and then trigger something that, for example, causes package imports on the workers: ``` using SomeTinyPackage ``` In my case (importing UnicodePlots on 10 workers), this improves the loading time over 10 workers from ~11s to ~5.5s. This is a far bigger issue when worker count gets high. The time of the processing on each worker is usually around 0.3s, so triggering this problem even on a relatively small cluster (64 workers) causes a really annoying delay, and running `@everywhere` for the first time on reasonable clusters (I tested with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks. Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s, and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't bother to measure that precisely now, sorry) to ~11s. Related issues: - Probably fixes #39291. - #42156 is a kinda complementary -- it removes the most painful source of slowness (the 0.3s precompilation on the workers), but the fact that the wait()ing is serial remains a problem if the network latencies are high. May help with #38931 Co-authored-by: Valentin Churavy <vchuravy@users.noreply.github.com>

giordano added performance Must go faster parallelism Parallel or distributed computation labels Mar 18, 2022

This was referenced Mar 19, 2022

improve concurrency LCSB-BioCore/DistributedData.jl#40

Merged

use distributed processing from DiDa LCSB-BioCore/GigaSOM.jl#203

Merged

KristofferC reviewed Mar 21, 2022

View reviewed changes

base/task.jl Outdated Show resolved Hide resolved

giordano reviewed Mar 21, 2022

View reviewed changes

base/task.jl Outdated Show resolved Hide resolved

vchuravy reviewed Mar 21, 2022

View reviewed changes

base/task.jl Outdated Show resolved Hide resolved

vtjnash reviewed Mar 21, 2022

View reviewed changes

stdlib/Distributed/src/Distributed.jl Outdated Show resolved Hide resolved

jonas-schulze reviewed Mar 21, 2022

View reviewed changes

base/task.jl Outdated Show resolved Hide resolved

exaexa and others added 6 commits March 22, 2022 11:13

undo the wrapping of remotecall exceptions in TaskFailedException

e3bbc96

May help with #38931

simplify wait() and fetch() methods for UnwrapTaskFailedException

1507a95

avoid overusing macros

d6c3e41

fix comment english

6b51ade

Co-authored-by: Valentin Churavy <vchuravy@users.noreply.github.com>

rename @async_nowrap to more precise @async_unwrap

f4dc075

explicitly import @async_unwrap from Base

e4fb7f2

(as it was done originally with `@sync_add`)

KristofferC merged commit 62e0729 into JuliaLang:master Mar 23, 2022

exaexa deleted the mk-fix-serial-sync-add branch March 23, 2022 10:22

KristofferC added backport 1.8 Change should be backported to release-1.8 backport 1.6 Change should be backported to release-1.6 labels Mar 23, 2022

KristofferC mentioned this pull request Mar 25, 2022

Backports for 1.8-rc1/beta3 #44710

Merged

22 tasks

KristofferC removed the backport 1.8 Change should be backported to release-1.8 label Mar 29, 2022

giordano added the backport 1.7 label Apr 20, 2022

giordano mentioned this pull request Apr 20, 2022

Backports for 1.7.3 #44189

Merged

40 tasks

KristofferC mentioned this pull request May 16, 2022

release-1.6: Backports for 1.6.7 #45319

Merged

45 tasks

KristofferC removed the backport 1.6 Change should be backported to release-1.6 label Jul 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

avoid using `@sync_add` on remotecalls #44671

avoid using `@sync_add` on remotecalls #44671

exaexa commented Mar 18, 2022 •

edited

Loading

exaexa commented Mar 18, 2022

exaexa commented Mar 18, 2022 •

edited

Loading

jpsamaroo commented Mar 18, 2022

exaexa commented Mar 18, 2022

exaexa commented Mar 19, 2022 •

edited

Loading

vtjnash commented Mar 21, 2022 •

edited

Loading

exaexa commented Mar 21, 2022 •

edited

Loading

exaexa commented Mar 21, 2022

exaexa commented Mar 21, 2022 •

edited

Loading

tkf commented Mar 23, 2022

exaexa commented Mar 23, 2022

tkf commented Mar 23, 2022

exaexa commented Mar 23, 2022

exaexa commented Mar 23, 2022

oscardssmith commented Mar 23, 2022

jishnub commented Mar 23, 2022

vtjnash commented Mar 23, 2022

exaexa commented Mar 23, 2022

giordano commented Apr 20, 2022

avoid using @sync_add on remotecalls #44671

avoid using @sync_add on remotecalls #44671

Conversation

exaexa commented Mar 18, 2022 • edited Loading

exaexa commented Mar 18, 2022

exaexa commented Mar 18, 2022 • edited Loading

jpsamaroo commented Mar 18, 2022

exaexa commented Mar 18, 2022

exaexa commented Mar 19, 2022 • edited Loading

A slightly more rigorous benchmark

vtjnash commented Mar 21, 2022 • edited Loading

exaexa commented Mar 21, 2022 • edited Loading

exaexa commented Mar 21, 2022

exaexa commented Mar 21, 2022 • edited Loading

tkf commented Mar 23, 2022

exaexa commented Mar 23, 2022

tkf commented Mar 23, 2022

exaexa commented Mar 23, 2022

exaexa commented Mar 23, 2022

oscardssmith commented Mar 23, 2022

jishnub commented Mar 23, 2022

vtjnash commented Mar 23, 2022

exaexa commented Mar 23, 2022

giordano commented Apr 20, 2022

avoid using `@sync_add` on remotecalls #44671

avoid using `@sync_add` on remotecalls #44671

exaexa commented Mar 18, 2022 •

edited

Loading

exaexa commented Mar 18, 2022 •

edited

Loading

exaexa commented Mar 19, 2022 •

edited

Loading

vtjnash commented Mar 21, 2022 •

edited

Loading

exaexa commented Mar 21, 2022 •

edited

Loading

exaexa commented Mar 21, 2022 •

edited

Loading