Unexplained slowness in `@everywhere` remotecalls with imports #44645

exaexa · 2022-03-16T17:52:22Z

Hello there!

So I spent some time debugging an issue similar to #42156 and perhaps #39291 and arrived at 2 different possible sources of slowness:

@everywhere uses @sync_add from remotecall_eval to add Futures into a queue that gets wait()ed in the end. Unfortunately, waiting for a future seems to cause some additional processing on each remote worker, which is run serially. I managed to get fully parallel and fast imports by doing this manually (code below).
For reasonable reasons, @everywhere loads all imports locally first. Which sounds like a good idea, but for unidentified reason (possibly this one) this causes a behavior where it "goes slowly through all workers" for me again.

To test it, I'm starting with:

using Distributed
addprocs(32)

If the program continues with

@time @everywhere workers() using Symbolics

I usually get something like 45 seconds. (tested on 16c dell xps, lots of RAM, the thing is not specific to Symbolics) The main problem is the way this uses CPU: First, Symbolics are imported locally (several seconds 1CPU), then there's the parallel part (~15s of all CPUs working) with actual loading, then there's the part when the results are being wait()ed for serially (another 15s, only 1 CPU works at a time), and finally there's a bit of extra parallel processing (roughly 10 seconds, for which I don't have any explanation).

This code manages to import the package in around 17 seconds, and the CPU use profile looks correctly "full" in this case, no gaps!

@time asyncmap(fetch, [remotecall(()->Base.eval(Main, :(using Symbolics)), pid) for pid in workers()])

This doesn't import Symbolics locally, which should take like 3 seconds (see below), but it's still not a fair comparison.

So I tried removing the line with auto-imports from @everywhere at

julia/stdlib/Distributed/src/macros.jl

Line 200 in 4627729

    
           $(isempty(imps) ? nothing : Expr(:toplevel, imps...)) # run imports locally first

, to make the "actions" comparable. That shrunk the time from around 45 to 33 seconds. Looking at the difference, I really wanted to understand what's so bad on this (it should be 3s not 12s!), so tried adding the import using the asyncmap code.

Unfortunately, importing the package first throws the time back to ~45 seconds

@time begin
    using Symbolics
    asyncmap(fetch, [remotecall(()->Base.eval(Main, :(using Symbolics)), pid) for pid in workers()])
end

...despite the time for importing Symbolics and time for running the asyncmap should add to under 20 seconds:

julia> @time @eval using Symbolics
  3.233303 seconds (9.33 M allocations: 634.171 MiB, 1.71% gc time, 21.58% compilation time)

Also happens on other packages (e.g. JuMP), but for Symbolics it behaves particularly weird.

Questions:

is there any explanation of what's happening?
If there's an explanation for why the import slows the thing down, would a patch to @everywhere be acceptable that replaces the @sync_add with asynchronous waiting for the futures?
EDIT: extra thought since yesterday: Is there any good way to debug/profile/log what's actually happening and getting processed at the workers? Obviously, throwing a @profile into the command doesn't reveal much.

Notably, fixing this would allow happy julia parallel computing on like 1000s of worker nodes with just basic Distributed and no dirty tricks. I managed to test the asyncmap on HPC with 1024 workers, loading our packages there shrinks from ~350 seconds with @everywhere to nice 10 seconds.

Thanks!
-mk

The text was updated successfully, but these errors were encountered:

exaexa · 2022-03-17T13:37:38Z

Extra observation: Simpler packages do not cause the weird problem of increasing the time when using is around, looks like it's only the ones that overload stuff from other packages. So likely very related to the precompile generation. (Need to measure precisely though.)

Problem with serial-ish @everywhere is still there (e.g., 9 vs 21 seconds with UnicodePlots).

exaexa · 2022-03-18T15:12:34Z

OK turns out that importing stuff after the workers are added is actually causes some communication with workers, which is visibly unoptimized (and in turn slow)...

This is a transcript from julia with debug print-outs in Distributed handle_msg:

julia> using Distributed

julia> addprocs(1)
1-element Vector{Int64}:
 2

julia>       From worker 2:	2 got header: Distributed.MsgHeader(Distributed.RRID(0, 0), Distributed.RRID(0, 0))
      From worker 2:	2 got msg T: Distributed.RemoteDoMsg
      From worker 2:	2 got msg: Distributed.RemoteDoMsg(Distributed.set_valid_processes, ([2],), Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}())
julia> 

julia> using UnicodePlots
[ Info: Precompiling UnicodePlots [b8865327-cd53-5732-bb35-84acbb429228]
      From worker 2:	2 got header: Distributed.MsgHeader(Distributed.RRID(1, 3), Distributed.RRID(0, 0))
      From worker 2:	2 got msg T: Distributed.CallMsg{:call}
      From worker 2:	2 got msg: Distributed.CallMsg{:call}(Distributed.var"#1#2"{Base.PkgId}(UnicodePlots [b8865327-cd53-5732-bb35-84acbb429228]), (), Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}())
      From worker 2:	2 got header: Distributed.MsgHeader(Distributed.RRID(0, 0), Distributed.RRID(1, 4))
      From worker 2:	2 got msg T: Distributed.CallMsg{:call_fetch}
      From worker 2:	2 got msg: Distributed.CallMsg{:call_fetch}(Distributed.wait_ref, (Distributed.RRID(1, 3), 1), Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}())
1 got header: Distributed.MsgHeader(Distributed.RRID(1, 4), Distributed.RRID(0, 0))
1 got msg T: Distributed.ResultMsg
1 got msg: Distributed.ResultMsg(nothing)

julia> 
      From worker 2:	2 got header: Distributed.MsgHeader(Distributed.RRID(0, 0), Distributed.RRID(0, 0))
      From worker 2:	2 got msg T: Distributed.RemoteDoMsg
      From worker 2:	2 got msg: Distributed.RemoteDoMsg(exit, (), Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}())

I was not able to find the source of that call in order to optimize it. I guess a similar fix as for the @everywhere could help there.

EDIT: The workers do not seem to get any information about UnicodePlots when the package loaded before adding workers, which looks a bit fishy. (Unless it gets there using a different channel)

EDIT2: Ah, found it, expect a patch. :D

@sync

* avoid using `@sync_add` on remotecalls It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which in turn calls wait() for all the futures synchronously. Not only that is slightly detrimental for network operations (latencies add up), but in case of Distributed the call to wait() may actually cause some compilation on remote processes, which is also wait()ed for. In result, some operations took a great amount of "serial" processing time if executed on many workers at once. For me, this closes #44645. The major change can be illustrated as follows: First add some workers: ``` using Distributed addprocs(10) ``` and then trigger something that, for example, causes package imports on the workers: ``` using SomeTinyPackage ``` In my case (importing UnicodePlots on 10 workers), this improves the loading time over 10 workers from ~11s to ~5.5s. This is a far bigger issue when worker count gets high. The time of the processing on each worker is usually around 0.3s, so triggering this problem even on a relatively small cluster (64 workers) causes a really annoying delay, and running `@everywhere` for the first time on reasonable clusters (I tested with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks. Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s, and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't bother to measure that precisely now, sorry) to ~11s. Related issues: - Probably fixes #39291. - #42156 is a kinda complementary -- it removes the most painful source of slowness (the 0.3s precompilation on the workers), but the fact that the wait()ing is serial remains a problem if the network latencies are high. May help with #38931 Co-authored-by: Valentin Churavy <vchuravy@users.noreply.github.com>

@sync

* avoid using `@sync_add` on remotecalls It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which in turn calls wait() for all the futures synchronously. Not only that is slightly detrimental for network operations (latencies add up), but in case of Distributed the call to wait() may actually cause some compilation on remote processes, which is also wait()ed for. In result, some operations took a great amount of "serial" processing time if executed on many workers at once. For me, this closes #44645. The major change can be illustrated as follows: First add some workers: ``` using Distributed addprocs(10) ``` and then trigger something that, for example, causes package imports on the workers: ``` using SomeTinyPackage ``` In my case (importing UnicodePlots on 10 workers), this improves the loading time over 10 workers from ~11s to ~5.5s. This is a far bigger issue when worker count gets high. The time of the processing on each worker is usually around 0.3s, so triggering this problem even on a relatively small cluster (64 workers) causes a really annoying delay, and running `@everywhere` for the first time on reasonable clusters (I tested with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks. Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s, and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't bother to measure that precisely now, sorry) to ~11s. Related issues: - Probably fixes #39291. - #42156 is a kinda complementary -- it removes the most painful source of slowness (the 0.3s precompilation on the workers), but the fact that the wait()ing is serial remains a problem if the network latencies are high. May help with #38931 Co-authored-by: Valentin Churavy <vchuravy@users.noreply.github.com> (cherry picked from commit 62e0729)

@sync

* avoid using `@sync_add` on remotecalls It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which in turn calls wait() for all the futures synchronously. Not only that is slightly detrimental for network operations (latencies add up), but in case of Distributed the call to wait() may actually cause some compilation on remote processes, which is also wait()ed for. In result, some operations took a great amount of "serial" processing time if executed on many workers at once. For me, this closes #44645. The major change can be illustrated as follows: First add some workers: ``` using Distributed addprocs(10) ``` and then trigger something that, for example, causes package imports on the workers: ``` using SomeTinyPackage ``` In my case (importing UnicodePlots on 10 workers), this improves the loading time over 10 workers from ~11s to ~5.5s. This is a far bigger issue when worker count gets high. The time of the processing on each worker is usually around 0.3s, so triggering this problem even on a relatively small cluster (64 workers) causes a really annoying delay, and running `@everywhere` for the first time on reasonable clusters (I tested with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks. Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s, and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't bother to measure that precisely now, sorry) to ~11s. Related issues: - Probably fixes #39291. - #42156 is a kinda complementary -- it removes the most painful source of slowness (the 0.3s precompilation on the workers), but the fact that the wait()ing is serial remains a problem if the network latencies are high. May help with #38931 Co-authored-by: Valentin Churavy <vchuravy@users.noreply.github.com> (cherry picked from commit 62e0729)

@sync

* avoid using `@sync_add` on remotecalls It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which in turn calls wait() for all the futures synchronously. Not only that is slightly detrimental for network operations (latencies add up), but in case of Distributed the call to wait() may actually cause some compilation on remote processes, which is also wait()ed for. In result, some operations took a great amount of "serial" processing time if executed on many workers at once. For me, this closes #44645. The major change can be illustrated as follows: First add some workers: ``` using Distributed addprocs(10) ``` and then trigger something that, for example, causes package imports on the workers: ``` using SomeTinyPackage ``` In my case (importing UnicodePlots on 10 workers), this improves the loading time over 10 workers from ~11s to ~5.5s. This is a far bigger issue when worker count gets high. The time of the processing on each worker is usually around 0.3s, so triggering this problem even on a relatively small cluster (64 workers) causes a really annoying delay, and running `@everywhere` for the first time on reasonable clusters (I tested with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks. Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s, and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't bother to measure that precisely now, sorry) to ~11s. Related issues: - Probably fixes #39291. - #42156 is a kinda complementary -- it removes the most painful source of slowness (the 0.3s precompilation on the workers), but the fact that the wait()ing is serial remains a problem if the network latencies are high. May help with #38931 Co-authored-by: Valentin Churavy <vchuravy@users.noreply.github.com> (cherry picked from commit 62e0729)

@sync

* avoid using `@sync_add` on remotecalls It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which in turn calls wait() for all the futures synchronously. Not only that is slightly detrimental for network operations (latencies add up), but in case of Distributed the call to wait() may actually cause some compilation on remote processes, which is also wait()ed for. In result, some operations took a great amount of "serial" processing time if executed on many workers at once. For me, this closes #44645. The major change can be illustrated as follows: First add some workers: ``` using Distributed addprocs(10) ``` and then trigger something that, for example, causes package imports on the workers: ``` using SomeTinyPackage ``` In my case (importing UnicodePlots on 10 workers), this improves the loading time over 10 workers from ~11s to ~5.5s. This is a far bigger issue when worker count gets high. The time of the processing on each worker is usually around 0.3s, so triggering this problem even on a relatively small cluster (64 workers) causes a really annoying delay, and running `@everywhere` for the first time on reasonable clusters (I tested with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks. Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s, and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't bother to measure that precisely now, sorry) to ~11s. Related issues: - Probably fixes #39291. - #42156 is a kinda complementary -- it removes the most painful source of slowness (the 0.3s precompilation on the workers), but the fact that the wait()ing is serial remains a problem if the network latencies are high. May help with #38931 Co-authored-by: Valentin Churavy <vchuravy@users.noreply.github.com> (cherry picked from commit 62e0729)

@sync

* avoid using `@sync_add` on remotecalls It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which in turn calls wait() for all the futures synchronously. Not only that is slightly detrimental for network operations (latencies add up), but in case of Distributed the call to wait() may actually cause some compilation on remote processes, which is also wait()ed for. In result, some operations took a great amount of "serial" processing time if executed on many workers at once. For me, this closes #44645. The major change can be illustrated as follows: First add some workers: ``` using Distributed addprocs(10) ``` and then trigger something that, for example, causes package imports on the workers: ``` using SomeTinyPackage ``` In my case (importing UnicodePlots on 10 workers), this improves the loading time over 10 workers from ~11s to ~5.5s. This is a far bigger issue when worker count gets high. The time of the processing on each worker is usually around 0.3s, so triggering this problem even on a relatively small cluster (64 workers) causes a really annoying delay, and running `@everywhere` for the first time on reasonable clusters (I tested with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks. Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s, and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't bother to measure that precisely now, sorry) to ~11s. Related issues: - Probably fixes #39291. - #42156 is a kinda complementary -- it removes the most painful source of slowness (the 0.3s precompilation on the workers), but the fact that the wait()ing is serial remains a problem if the network latencies are high. May help with #38931 Co-authored-by: Valentin Churavy <vchuravy@users.noreply.github.com> (cherry picked from commit 62e0729)

@sync

* avoid using `@sync_add` on remotecalls It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which in turn calls wait() for all the futures synchronously. Not only that is slightly detrimental for network operations (latencies add up), but in case of Distributed the call to wait() may actually cause some compilation on remote processes, which is also wait()ed for. In result, some operations took a great amount of "serial" processing time if executed on many workers at once. For me, this closes #44645. The major change can be illustrated as follows: First add some workers: ``` using Distributed addprocs(10) ``` and then trigger something that, for example, causes package imports on the workers: ``` using SomeTinyPackage ``` In my case (importing UnicodePlots on 10 workers), this improves the loading time over 10 workers from ~11s to ~5.5s. This is a far bigger issue when worker count gets high. The time of the processing on each worker is usually around 0.3s, so triggering this problem even on a relatively small cluster (64 workers) causes a really annoying delay, and running `@everywhere` for the first time on reasonable clusters (I tested with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks. Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s, and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't bother to measure that precisely now, sorry) to ~11s. Related issues: - Probably fixes #39291. - #42156 is a kinda complementary -- it removes the most painful source of slowness (the 0.3s precompilation on the workers), but the fact that the wait()ing is serial remains a problem if the network latencies are high. May help with #38931 Co-authored-by: Valentin Churavy <vchuravy@users.noreply.github.com> (cherry picked from commit 62e0729)

@sync

* avoid using `@sync_add` on remotecalls It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which in turn calls wait() for all the futures synchronously. Not only that is slightly detrimental for network operations (latencies add up), but in case of Distributed the call to wait() may actually cause some compilation on remote processes, which is also wait()ed for. In result, some operations took a great amount of "serial" processing time if executed on many workers at once. For me, this closes #44645. The major change can be illustrated as follows: First add some workers: ``` using Distributed addprocs(10) ``` and then trigger something that, for example, causes package imports on the workers: ``` using SomeTinyPackage ``` In my case (importing UnicodePlots on 10 workers), this improves the loading time over 10 workers from ~11s to ~5.5s. This is a far bigger issue when worker count gets high. The time of the processing on each worker is usually around 0.3s, so triggering this problem even on a relatively small cluster (64 workers) causes a really annoying delay, and running `@everywhere` for the first time on reasonable clusters (I tested with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks. Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s, and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't bother to measure that precisely now, sorry) to ~11s. Related issues: - Probably fixes #39291. - #42156 is a kinda complementary -- it removes the most painful source of slowness (the 0.3s precompilation on the workers), but the fact that the wait()ing is serial remains a problem if the network latencies are high. May help with #38931 Co-authored-by: Valentin Churavy <vchuravy@users.noreply.github.com> (cherry picked from commit 62e0729)

@sync

* avoid using `@sync_add` on remotecalls It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which in turn calls wait() for all the futures synchronously. Not only that is slightly detrimental for network operations (latencies add up), but in case of Distributed the call to wait() may actually cause some compilation on remote processes, which is also wait()ed for. In result, some operations took a great amount of "serial" processing time if executed on many workers at once. For me, this closes JuliaLang/julia#44645. The major change can be illustrated as follows: First add some workers: ``` using Distributed addprocs(10) ``` and then trigger something that, for example, causes package imports on the workers: ``` using SomeTinyPackage ``` In my case (importing UnicodePlots on 10 workers), this improves the loading time over 10 workers from ~11s to ~5.5s. This is a far bigger issue when worker count gets high. The time of the processing on each worker is usually around 0.3s, so triggering this problem even on a relatively small cluster (64 workers) causes a really annoying delay, and running `@everywhere` for the first time on reasonable clusters (I tested with 1024 workers, see JuliaLang/julia#44645) usually takes more than 5 minutes. Which sucks. Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s, and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't bother to measure that precisely now, sorry) to ~11s. Related issues: - Probably fixes JuliaLang/julia#39291. - JuliaLang/julia#42156 is a kinda complementary -- it removes the most painful source of slowness (the 0.3s precompilation on the workers), but the fact that the wait()ing is serial remains a problem if the network latencies are high. May help with JuliaLang/julia#38931 Co-authored-by: Valentin Churavy <vchuravy@users.noreply.github.com> (cherry picked from commit 3b57a49)

@sync

* avoid using `@sync_add` on remotecalls It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which in turn calls wait() for all the futures synchronously. Not only that is slightly detrimental for network operations (latencies add up), but in case of Distributed the call to wait() may actually cause some compilation on remote processes, which is also wait()ed for. In result, some operations took a great amount of "serial" processing time if executed on many workers at once. For me, this closes #44645. The major change can be illustrated as follows: First add some workers: ``` using Distributed addprocs(10) ``` and then trigger something that, for example, causes package imports on the workers: ``` using SomeTinyPackage ``` In my case (importing UnicodePlots on 10 workers), this improves the loading time over 10 workers from ~11s to ~5.5s. This is a far bigger issue when worker count gets high. The time of the processing on each worker is usually around 0.3s, so triggering this problem even on a relatively small cluster (64 workers) causes a really annoying delay, and running `@everywhere` for the first time on reasonable clusters (I tested with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks. Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s, and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't bother to measure that precisely now, sorry) to ~11s. Related issues: - Probably fixes #39291. - #42156 is a kinda complementary -- it removes the most painful source of slowness (the 0.3s precompilation on the workers), but the fact that the wait()ing is serial remains a problem if the network latencies are high. May help with #38931 Co-authored-by: Valentin Churavy <vchuravy@users.noreply.github.com>

giordano added the parallelism Parallel or distributed computation label Mar 16, 2022

exaexa mentioned this issue Mar 18, 2022

avoid using @sync_add on remotecalls #44671

Merged

KristofferC closed this as completed in #44671 Mar 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexplained slowness in `@everywhere` remotecalls with imports #44645

Unexplained slowness in `@everywhere` remotecalls with imports #44645

exaexa commented Mar 16, 2022 •

edited

Loading

exaexa commented Mar 17, 2022 •

edited

Loading

exaexa commented Mar 18, 2022 •

edited

Loading

Unexplained slowness in @everywhere remotecalls with imports #44645

Unexplained slowness in @everywhere remotecalls with imports #44645

Comments

exaexa commented Mar 16, 2022 • edited Loading

exaexa commented Mar 17, 2022 • edited Loading

exaexa commented Mar 18, 2022 • edited Loading

Unexplained slowness in `@everywhere` remotecalls with imports #44645

Unexplained slowness in `@everywhere` remotecalls with imports #44645

exaexa commented Mar 16, 2022 •

edited

Loading

exaexa commented Mar 17, 2022 •

edited

Loading

exaexa commented Mar 18, 2022 •

edited

Loading