reenable the precompile generation for Distributed #42156

KristofferC · 2021-09-08T10:47:47Z

This was disabled in #37816 with the reason that latency for Distributed should not be crucial and that there was a lot of badly typed code in Distributed leading to invalidations in other code when this was precompiled into the sysimage.

However, it is becoming apparent that this had a significantly larger effect than intended, see #39291 and https://discourse.julialang.org/t/why-does-julia-use-thousands-of-cpu-hours-to-compute-1-2/66041.

Therefore, I think it is best to reenable this.

Some timings:

Master:

julia -p 64 -e 'using Distributed; @everywhere 1+1'  285.89s user 9.65s system 604% cpu 48.914 total

PR:

./julia -p 64 -e 'using Distributed; @everywhere 1+1'  29.68s user 2.21s system 174% cpu 18.256 total

cc @moble, @vancleve, @algorithmx

timholy · 2021-09-08T13:21:14Z

It would be great if one of the heavy or affected users could take Distributed under a wing. Last I looked (and admittedly there have been a bunch of PRs since then...), Distributed had nice functionality but was a bit of a train wreck when it came to inferrability. Maybe it doesn't matter for performance, but maybe it does, and loading a few packages is pretty likely to invalidate big chunks of what gets precompiled here.

Still, if it solves real-world problems, no objections to putting this back. I just don't plan on fixing any of the invalidations myself 🙂.

GregPlowman · 2021-09-12T23:07:18Z

I use 200 workers across 5 machines (20 local + 4 x 45 remote).
Adding the 200 workers takes over 3 minutes (might be different issue)
After workers are added, loading with @everywhere using takes another 7-8 minutes.

KristofferC · 2021-09-13T09:31:52Z

With or without this PR? If without, could you try with this one?

GregPlowman · 2021-09-13T22:18:44Z

With or without this PR? If without, could you try with this one?

Sorry, I should have given more info.
Times were for Julia v1.6.2 release version.
Faster loading on Julia 1.7.0-beta4, see comparison below.
Unfortunately and somewhat embarrassingly, I don't know how to test using this PR.

	Julia 1.6.2 (seconds)	Julia 1.7.0-beta4 (seconds)	Julia 1.8.0-DEV with this PR
Adding workers	185	204	72
Loading modules	445	192	297
Total	630	396	369

Local machine running Windows 10. Remote machines are Window Server 2019.

KristofferC · 2021-09-15T08:54:45Z

Unfortunately and somewhat embarrassingly, I don't know how to test using this PR.

You can get it from: https://s3.amazonaws.com/julialangnightlies/assert_pretesting/winnt/x64/1.8/julia-12621148ff-win64.exe

jebej · 2021-09-15T13:06:23Z

Could we backport this to 1.7?

GregPlowman · 2021-09-16T03:29:47Z

You can get it from: https://s3.amazonaws.com/julialangnightlies/assert_pretesting/winnt/x64/1.8/julia-12621148ff-win64.exe

Thanks Kristoffer.
I've updated table in previous post with times from this PR.
Adding workers is faster.
Loading modules is slower than 1.7.0-beta4.

ViralBShah · 2022-03-11T21:14:54Z

@KristofferC Should we get this merged?

timholy · 2022-03-12T15:07:18Z

There does not seem to be a lot of evidence that it helps. Sure, it's faster to add workers. But load just about any package and "poof," much of the compiled code in Distributed gets invalidated and has to be recompiled before the next operation can succeed. (The "total" line above is not obviously altered by this PR, though a benchmark on the same version with & without the PR would help clarify that.)

Until someone sits down and fixes the inference problems in Distributed, this will only add noise for anyone hunting for invalidations but who may not be a big user of Distributed. As well as make Julia slower to build.

I don't think there's any stdlib that needs this kind of attention as much as Distributed. REPL, Pkg, Tar, and others that used to have lots of inference problems have mostly been fixed.

vancleve · 2022-03-13T22:34:35Z

Agreed with @timholy here about the benefit of this change.

@timholy, do you think fixing the inference problems will fix this problem though? Based on my experience detailed here, #39291 (comment), it seems like the compile time is scaling with the number of workers (CPUs on the same or different nodes).

@moble suggests that its because julia is waiting for something or blocking after sending something to each worker (#39291 (comment)); is that true and if so, is fixing that (making this process not blocking) actually the solution here?

timholy · 2022-03-14T01:55:23Z

I haven't done a serious analysis of the causes of Distributed's slowness; all I've noticed is that there's a lot of poorly-inferred code that can, if precompiled, be invalidated by a wide swath of packages. How important that is (compared to other causes) remains to be determined.

@sync

* avoid using `@sync_add` on remotecalls It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which in turn calls wait() for all the futures synchronously. Not only that is slightly detrimental for network operations (latencies add up), but in case of Distributed the call to wait() may actually cause some compilation on remote processes, which is also wait()ed for. In result, some operations took a great amount of "serial" processing time if executed on many workers at once. For me, this closes #44645. The major change can be illustrated as follows: First add some workers: ``` using Distributed addprocs(10) ``` and then trigger something that, for example, causes package imports on the workers: ``` using SomeTinyPackage ``` In my case (importing UnicodePlots on 10 workers), this improves the loading time over 10 workers from ~11s to ~5.5s. This is a far bigger issue when worker count gets high. The time of the processing on each worker is usually around 0.3s, so triggering this problem even on a relatively small cluster (64 workers) causes a really annoying delay, and running `@everywhere` for the first time on reasonable clusters (I tested with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks. Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s, and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't bother to measure that precisely now, sorry) to ~11s. Related issues: - Probably fixes #39291. - #42156 is a kinda complementary -- it removes the most painful source of slowness (the 0.3s precompilation on the workers), but the fact that the wait()ing is serial remains a problem if the network latencies are high. May help with #38931 Co-authored-by: Valentin Churavy <vchuravy@users.noreply.github.com>

@sync

* avoid using `@sync_add` on remotecalls It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which in turn calls wait() for all the futures synchronously. Not only that is slightly detrimental for network operations (latencies add up), but in case of Distributed the call to wait() may actually cause some compilation on remote processes, which is also wait()ed for. In result, some operations took a great amount of "serial" processing time if executed on many workers at once. For me, this closes #44645. The major change can be illustrated as follows: First add some workers: ``` using Distributed addprocs(10) ``` and then trigger something that, for example, causes package imports on the workers: ``` using SomeTinyPackage ``` In my case (importing UnicodePlots on 10 workers), this improves the loading time over 10 workers from ~11s to ~5.5s. This is a far bigger issue when worker count gets high. The time of the processing on each worker is usually around 0.3s, so triggering this problem even on a relatively small cluster (64 workers) causes a really annoying delay, and running `@everywhere` for the first time on reasonable clusters (I tested with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks. Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s, and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't bother to measure that precisely now, sorry) to ~11s. Related issues: - Probably fixes #39291. - #42156 is a kinda complementary -- it removes the most painful source of slowness (the 0.3s precompilation on the workers), but the fact that the wait()ing is serial remains a problem if the network latencies are high. May help with #38931 Co-authored-by: Valentin Churavy <vchuravy@users.noreply.github.com> (cherry picked from commit 62e0729)

@sync

* avoid using `@sync_add` on remotecalls It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which in turn calls wait() for all the futures synchronously. Not only that is slightly detrimental for network operations (latencies add up), but in case of Distributed the call to wait() may actually cause some compilation on remote processes, which is also wait()ed for. In result, some operations took a great amount of "serial" processing time if executed on many workers at once. For me, this closes #44645. The major change can be illustrated as follows: First add some workers: ``` using Distributed addprocs(10) ``` and then trigger something that, for example, causes package imports on the workers: ``` using SomeTinyPackage ``` In my case (importing UnicodePlots on 10 workers), this improves the loading time over 10 workers from ~11s to ~5.5s. This is a far bigger issue when worker count gets high. The time of the processing on each worker is usually around 0.3s, so triggering this problem even on a relatively small cluster (64 workers) causes a really annoying delay, and running `@everywhere` for the first time on reasonable clusters (I tested with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks. Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s, and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't bother to measure that precisely now, sorry) to ~11s. Related issues: - Probably fixes #39291. - #42156 is a kinda complementary -- it removes the most painful source of slowness (the 0.3s precompilation on the workers), but the fact that the wait()ing is serial remains a problem if the network latencies are high. May help with #38931 Co-authored-by: Valentin Churavy <vchuravy@users.noreply.github.com> (cherry picked from commit 62e0729)

@sync

* avoid using `@sync_add` on remotecalls It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which in turn calls wait() for all the futures synchronously. Not only that is slightly detrimental for network operations (latencies add up), but in case of Distributed the call to wait() may actually cause some compilation on remote processes, which is also wait()ed for. In result, some operations took a great amount of "serial" processing time if executed on many workers at once. For me, this closes #44645. The major change can be illustrated as follows: First add some workers: ``` using Distributed addprocs(10) ``` and then trigger something that, for example, causes package imports on the workers: ``` using SomeTinyPackage ``` In my case (importing UnicodePlots on 10 workers), this improves the loading time over 10 workers from ~11s to ~5.5s. This is a far bigger issue when worker count gets high. The time of the processing on each worker is usually around 0.3s, so triggering this problem even on a relatively small cluster (64 workers) causes a really annoying delay, and running `@everywhere` for the first time on reasonable clusters (I tested with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks. Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s, and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't bother to measure that precisely now, sorry) to ~11s. Related issues: - Probably fixes #39291. - #42156 is a kinda complementary -- it removes the most painful source of slowness (the 0.3s precompilation on the workers), but the fact that the wait()ing is serial remains a problem if the network latencies are high. May help with #38931 Co-authored-by: Valentin Churavy <vchuravy@users.noreply.github.com> (cherry picked from commit 62e0729)

@sync

* avoid using `@sync_add` on remotecalls It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which in turn calls wait() for all the futures synchronously. Not only that is slightly detrimental for network operations (latencies add up), but in case of Distributed the call to wait() may actually cause some compilation on remote processes, which is also wait()ed for. In result, some operations took a great amount of "serial" processing time if executed on many workers at once. For me, this closes #44645. The major change can be illustrated as follows: First add some workers: ``` using Distributed addprocs(10) ``` and then trigger something that, for example, causes package imports on the workers: ``` using SomeTinyPackage ``` In my case (importing UnicodePlots on 10 workers), this improves the loading time over 10 workers from ~11s to ~5.5s. This is a far bigger issue when worker count gets high. The time of the processing on each worker is usually around 0.3s, so triggering this problem even on a relatively small cluster (64 workers) causes a really annoying delay, and running `@everywhere` for the first time on reasonable clusters (I tested with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks. Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s, and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't bother to measure that precisely now, sorry) to ~11s. Related issues: - Probably fixes #39291. - #42156 is a kinda complementary -- it removes the most painful source of slowness (the 0.3s precompilation on the workers), but the fact that the wait()ing is serial remains a problem if the network latencies are high. May help with #38931 Co-authored-by: Valentin Churavy <vchuravy@users.noreply.github.com> (cherry picked from commit 62e0729)

@sync

* avoid using `@sync_add` on remotecalls It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which in turn calls wait() for all the futures synchronously. Not only that is slightly detrimental for network operations (latencies add up), but in case of Distributed the call to wait() may actually cause some compilation on remote processes, which is also wait()ed for. In result, some operations took a great amount of "serial" processing time if executed on many workers at once. For me, this closes #44645. The major change can be illustrated as follows: First add some workers: ``` using Distributed addprocs(10) ``` and then trigger something that, for example, causes package imports on the workers: ``` using SomeTinyPackage ``` In my case (importing UnicodePlots on 10 workers), this improves the loading time over 10 workers from ~11s to ~5.5s. This is a far bigger issue when worker count gets high. The time of the processing on each worker is usually around 0.3s, so triggering this problem even on a relatively small cluster (64 workers) causes a really annoying delay, and running `@everywhere` for the first time on reasonable clusters (I tested with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks. Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s, and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't bother to measure that precisely now, sorry) to ~11s. Related issues: - Probably fixes #39291. - #42156 is a kinda complementary -- it removes the most painful source of slowness (the 0.3s precompilation on the workers), but the fact that the wait()ing is serial remains a problem if the network latencies are high. May help with #38931 Co-authored-by: Valentin Churavy <vchuravy@users.noreply.github.com> (cherry picked from commit 62e0729)

@sync

* avoid using `@sync_add` on remotecalls It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which in turn calls wait() for all the futures synchronously. Not only that is slightly detrimental for network operations (latencies add up), but in case of Distributed the call to wait() may actually cause some compilation on remote processes, which is also wait()ed for. In result, some operations took a great amount of "serial" processing time if executed on many workers at once. For me, this closes #44645. The major change can be illustrated as follows: First add some workers: ``` using Distributed addprocs(10) ``` and then trigger something that, for example, causes package imports on the workers: ``` using SomeTinyPackage ``` In my case (importing UnicodePlots on 10 workers), this improves the loading time over 10 workers from ~11s to ~5.5s. This is a far bigger issue when worker count gets high. The time of the processing on each worker is usually around 0.3s, so triggering this problem even on a relatively small cluster (64 workers) causes a really annoying delay, and running `@everywhere` for the first time on reasonable clusters (I tested with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks. Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s, and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't bother to measure that precisely now, sorry) to ~11s. Related issues: - Probably fixes #39291. - #42156 is a kinda complementary -- it removes the most painful source of slowness (the 0.3s precompilation on the workers), but the fact that the wait()ing is serial remains a problem if the network latencies are high. May help with #38931 Co-authored-by: Valentin Churavy <vchuravy@users.noreply.github.com> (cherry picked from commit 62e0729)

@sync

* avoid using `@sync_add` on remotecalls It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which in turn calls wait() for all the futures synchronously. Not only that is slightly detrimental for network operations (latencies add up), but in case of Distributed the call to wait() may actually cause some compilation on remote processes, which is also wait()ed for. In result, some operations took a great amount of "serial" processing time if executed on many workers at once. For me, this closes #44645. The major change can be illustrated as follows: First add some workers: ``` using Distributed addprocs(10) ``` and then trigger something that, for example, causes package imports on the workers: ``` using SomeTinyPackage ``` In my case (importing UnicodePlots on 10 workers), this improves the loading time over 10 workers from ~11s to ~5.5s. This is a far bigger issue when worker count gets high. The time of the processing on each worker is usually around 0.3s, so triggering this problem even on a relatively small cluster (64 workers) causes a really annoying delay, and running `@everywhere` for the first time on reasonable clusters (I tested with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks. Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s, and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't bother to measure that precisely now, sorry) to ~11s. Related issues: - Probably fixes #39291. - #42156 is a kinda complementary -- it removes the most painful source of slowness (the 0.3s precompilation on the workers), but the fact that the wait()ing is serial remains a problem if the network latencies are high. May help with #38931 Co-authored-by: Valentin Churavy <vchuravy@users.noreply.github.com> (cherry picked from commit 62e0729)

@sync

* avoid using `@sync_add` on remotecalls It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which in turn calls wait() for all the futures synchronously. Not only that is slightly detrimental for network operations (latencies add up), but in case of Distributed the call to wait() may actually cause some compilation on remote processes, which is also wait()ed for. In result, some operations took a great amount of "serial" processing time if executed on many workers at once. For me, this closes JuliaLang/julia#44645. The major change can be illustrated as follows: First add some workers: ``` using Distributed addprocs(10) ``` and then trigger something that, for example, causes package imports on the workers: ``` using SomeTinyPackage ``` In my case (importing UnicodePlots on 10 workers), this improves the loading time over 10 workers from ~11s to ~5.5s. This is a far bigger issue when worker count gets high. The time of the processing on each worker is usually around 0.3s, so triggering this problem even on a relatively small cluster (64 workers) causes a really annoying delay, and running `@everywhere` for the first time on reasonable clusters (I tested with 1024 workers, see JuliaLang/julia#44645) usually takes more than 5 minutes. Which sucks. Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s, and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't bother to measure that precisely now, sorry) to ~11s. Related issues: - Probably fixes JuliaLang/julia#39291. - JuliaLang/julia#42156 is a kinda complementary -- it removes the most painful source of slowness (the 0.3s precompilation on the workers), but the fact that the wait()ing is serial remains a problem if the network latencies are high. May help with JuliaLang/julia#38931 Co-authored-by: Valentin Churavy <vchuravy@users.noreply.github.com> (cherry picked from commit 3b57a49)

vtjnash · 2024-02-11T00:34:17Z

This should be re-enabled over on JuliaLang/Distributed.jl#71 now instead

@sync

* avoid using `@sync_add` on remotecalls It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which in turn calls wait() for all the futures synchronously. Not only that is slightly detrimental for network operations (latencies add up), but in case of Distributed the call to wait() may actually cause some compilation on remote processes, which is also wait()ed for. In result, some operations took a great amount of "serial" processing time if executed on many workers at once. For me, this closes #44645. The major change can be illustrated as follows: First add some workers: ``` using Distributed addprocs(10) ``` and then trigger something that, for example, causes package imports on the workers: ``` using SomeTinyPackage ``` In my case (importing UnicodePlots on 10 workers), this improves the loading time over 10 workers from ~11s to ~5.5s. This is a far bigger issue when worker count gets high. The time of the processing on each worker is usually around 0.3s, so triggering this problem even on a relatively small cluster (64 workers) causes a really annoying delay, and running `@everywhere` for the first time on reasonable clusters (I tested with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks. Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s, and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't bother to measure that precisely now, sorry) to ~11s. Related issues: - Probably fixes #39291. - #42156 is a kinda complementary -- it removes the most painful source of slowness (the 0.3s precompilation on the workers), but the fact that the wait()ing is serial remains a problem if the network latencies are high. May help with #38931 Co-authored-by: Valentin Churavy <vchuravy@users.noreply.github.com>

reenable the precompile generation for Distributed

1262114

KristofferC added parallelism Parallel or distributed computation compiler:precompilation Precompilation of modules labels Sep 8, 2021

moble mentioned this pull request Sep 9, 2021

@everywhere is slow on HPC with multi-node environment #39291

Closed

This was referenced Mar 16, 2022

Unexplained slowness in @everywhere remotecalls with imports #44645

Closed

avoid using @sync_add on remotecalls #44671

Merged

vtjnash mentioned this pull request Feb 11, 2024

need precompile statements re-enabled for addprocs (with PR) JuliaLang/Distributed.jl#71

Open

vtjnash closed this Feb 11, 2024

vtjnash deleted the kc/reenable_distributed_precompile branch February 11, 2024 00:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reenable the precompile generation for Distributed #42156

reenable the precompile generation for Distributed #42156

KristofferC commented Sep 8, 2021

timholy commented Sep 8, 2021

GregPlowman commented Sep 12, 2021

KristofferC commented Sep 13, 2021

GregPlowman commented Sep 13, 2021 •

edited

Loading

KristofferC commented Sep 15, 2021

jebej commented Sep 15, 2021

GregPlowman commented Sep 16, 2021

ViralBShah commented Mar 11, 2022

timholy commented Mar 12, 2022 •

edited

Loading

vancleve commented Mar 13, 2022

timholy commented Mar 14, 2022

vtjnash commented Feb 11, 2024

reenable the precompile generation for Distributed #42156

reenable the precompile generation for Distributed #42156

Conversation

KristofferC commented Sep 8, 2021

timholy commented Sep 8, 2021

GregPlowman commented Sep 12, 2021

KristofferC commented Sep 13, 2021

GregPlowman commented Sep 13, 2021 • edited Loading

KristofferC commented Sep 15, 2021

jebej commented Sep 15, 2021

GregPlowman commented Sep 16, 2021

ViralBShah commented Mar 11, 2022

timholy commented Mar 12, 2022 • edited Loading

vancleve commented Mar 13, 2022

timholy commented Mar 14, 2022

vtjnash commented Feb 11, 2024

GregPlowman commented Sep 13, 2021 •

edited

Loading

timholy commented Mar 12, 2022 •

edited

Loading