-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelising and loop performance #7
Comments
Here is the code for a very simple example I used to test the performance of some different approaches to threading
The result of running this on Marconi is
|
Update: I've been doing more testing on threading in various ways with Julia, and just had a chat with Joseph about the options. Thread-based parallelism the simple way (using I tried for a while to get something where the threads would be started at the beginning of the simulation and keep running the whole time - a sort of 'MPI-lite'. This needs some way to synchronise running threads though - OpenMP has one but Julia doesn't and I failed to write one that is efficient (see so-far-unanswered Discourse question here. This has lead us to think that MPI may be the best way to go after all. MPI does allow shared-memory arrays (within a node) - someone (I think Joseph or Peter Hill?) pointed me to that a while back. So we could do a hierarchy of parallism using MPI:
|
I tried profiling some parallel runs, and I'm a bit confused... The profiler claims that a 24 core run reports ~ 86% of samples in the I think it might be possible to reduce the number of |
What are you profiling with? Are the samples taken on all cores, or just one? I agree it's probably premature though! |
Profiling with Julia's built-in sampling profiler, and using StatProfilerHTML.jl to visualise the results. Samples taken on all processes, and saved in separate directories (the developer actually added that feature this week because I requested it for this!). I checked the first and last processes. |
I have been looking at documentation for Julia and some packages for thread-based parallelism and other loop optimisations. Opening this issue to keep notes...
@threads
macro on a loop seems to work pretty well (test case and results posted in a comment below).GPUArrays.jl
. Might be interesting in the long run for actual GPU usage. At one point it had a non-GPU implementation that used threads, but that has been removed from the main repo (they still use it for testing).Strided.jl
. I haven't understood what it's really for yet - some kind of optimisation of array operations. Does support threading of broadcast operations, but for a trivial test case is slower than just using a threaded loop (as this is not its main purpose). Maybe worth looking at again and testing with multi-dimensional array operations.LoopVectorization.jl
. This package is aimed at optimising loops using things like AVX instructions. I ran into a bug trying to use atan
function -ERROR: LoadError: UndefVarError: tan_fast not defined
- and for a trivial test (similar to the one in the comment below but swappingexp
in place oftan
) didn't get better performance than just using@threads
.README.md
ofLoopVectorization.jl
https://github.com/JuliaSIMD/LoopVectorization.jl/blob/master/README.md#broadcastingThe text was updated successfully, but these errors were encountered: