Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelising and loop performance #7

Closed
johnomotani opened this issue Jun 23, 2021 · 7 comments · Fixed by #34
Closed

Parallelising and loop performance #7

johnomotani opened this issue Jun 23, 2021 · 7 comments · Fixed by #34

Comments

@johnomotani
Copy link
Collaborator

I have been looking at documentation for Julia and some packages for thread-based parallelism and other loop optimisations. Opening this issue to keep notes...

  • Just sticking the @threads macro on a loop seems to work pretty well (test case and results posted in a comment below).
    • variations on putting the work in a function or just writing it out in full seem to make minimal difference
  • I didn't find a nice solution for getting threading with array broadcasting. This thread multi-threaded (@threads) dotcall/broadcast? JuliaLang/julia#19777 is a bit old unfortunately. Suggestions from it:
    • GPUArrays.jl. Might be interesting in the long run for actual GPU usage. At one point it had a non-GPU implementation that used threads, but that has been removed from the main repo (they still use it for testing).
    • Strided.jl. I haven't understood what it's really for yet - some kind of optimisation of array operations. Does support threading of broadcast operations, but for a trivial test case is slower than just using a threaded loop (as this is not its main purpose). Maybe worth looking at again and testing with multi-dimensional array operations.
  • This comment Correct way to parallelize this code? Jutho/Strided.jl#9 (comment) suggested looking at LoopVectorization.jl. This package is aimed at optimising loops using things like AVX instructions. I ran into a bug trying to use a tan function - ERROR: LoadError: UndefVarError: tan_fast not defined - and for a trivial test (similar to the one in the comment below but swapping exp in place of tan) didn't get better performance than just using @threads.
  • According to the README.md of LoopVectorization.jl https://github.com/JuliaSIMD/LoopVectorization.jl/blob/master/README.md#broadcasting

    Note that loops will be faster than broadcasting in general. This is because the behavior of broadcasts is determined by runtime information (i.e., dimensions other than the leading dimension of size 1 will be broadcasted; it is not known which these will be at compile time).

@johnomotani
Copy link
Collaborator Author

Here is the code for a very simple example I used to test the performance of some different approaches to threading

# Uses Strided package to get broadcasting with threads. See
# https://github.com/Jutho/Strided.jl.

using Base.Threads: @threads
using BenchmarkTools: @benchmark
using Random
using Strided

n = 100000

Random.seed!(1234)
const x = rand(Float64, n)


@inline f(x) = sin(x) + cos(x) * exp(x) - exp(x^2) * sin(2*x) + tan(3*x)
@inline f(x, i) = sin(x[i]) + cos(x[i]) * exp(x[i]) - exp(x[i]^2) * sin(2*x[i]) + tan(3*x[i])

function dostuff_broadcast(x)
    return @. f(x)
end

function dostuff_strided(x)
    return @strided @. f(x)
end

function dostuff_unsafe_strided(x)
    y = similar(x)
    @unsafe_strided y x @. y = f(x)
    return y
end

function dostuff_threaded(x)
    result = similar(x)
    @threads for i ∈ eachindex(x)
        result[i] = f(x[i])
    end
    return result
end

function dostuff_singleindex(x)
    result = similar(x)
    @threads for i ∈ eachindex(x)
        result[i] = f(x, i)
    end
    return result
end

function dostuff_inline(x)
    result = similar(x)
    @threads for i ∈ eachindex(x)
        result[i] = sin(x[i]) + cos(x[i]) * exp(x[i]) - exp(x[i]^2) * sin(2*x[i]) + tan(3*x[i])
    end
    return result
end


expected = dostuff_broadcast(x)


println("serial")
println("------")

result = @benchmark dostuff_broadcast($x)

@assert dostuff_broadcast(x) == expected

display(result)
println("")
println("")
println("")


println("strided broadcast")
println("-----------------")

result = @benchmark dostuff_strided($x)

@assert dostuff_strided(x) == expected

display(result)
println("")
println("")
println("")


println("unsafe_strided broadcast")
println("------------------------")

result = @benchmark dostuff_unsafe_strided($x)

@assert dostuff_strided(x) == expected

display(result)
println("")
println("")
println("")


println("threaded by hand")
println("----------------")

result = @benchmark dostuff_threaded($x)

@assert dostuff_threaded(x) == expected

display(result)
println("")
println("")
println("")


println("single index")
println("------------")

result = @benchmark dostuff_singleindex($x)

@assert dostuff_singleindex(x) == expected

display(result)
println("")
println("")
println("")


println("threaded inline")
println("---------------")

result = @benchmark dostuff_inline($x)

@assert dostuff_inline(x) ≈ expected

display(result)
println("")
println("")
println("")

The result of running this on Marconi is

$ julia -O 3 --check-bounds=no -t 6 broadcast_test.jl
serial
------
BenchmarkTools.Trial: 
  memory estimate:  781.33 KiB
  allocs estimate:  2
  --------------
  minimum time:     6.294 ms (0.00% GC)
  median time:      6.316 ms (0.00% GC)
  mean time:        6.404 ms (0.33% GC)
  maximum time:     9.805 ms (34.21% GC)
  --------------
  samples:          780
  evals/sample:     1


strided broadcast
-----------------
BenchmarkTools.Trial: 
  memory estimate:  784.62 KiB
  allocs estimate:  65
  --------------
  minimum time:     1.685 ms (0.00% GC)
  median time:      2.079 ms (0.00% GC)
  mean time:        2.389 ms (0.66% GC)
  maximum time:     5.593 ms (31.35% GC)
  --------------
  samples:          2087
  evals/sample:     1


unsafe_strided broadcast
------------------------
BenchmarkTools.Trial: 
  memory estimate:  784.62 KiB
  allocs estimate:  65
  --------------
  minimum time:     1.642 ms (0.00% GC)
  median time:      2.049 ms (0.00% GC)
  mean time:        2.366 ms (0.82% GC)
  maximum time:     4.895 ms (0.00% GC)
  --------------
  samples:          2106
  evals/sample:     1


threaded by hand
----------------
BenchmarkTools.Trial: 
  memory estimate:  784.08 KiB
  allocs estimate:  33
  --------------
  minimum time:     1.079 ms (0.00% GC)
  median time:      1.115 ms (0.00% GC)
  mean time:        1.222 ms (0.69% GC)
  maximum time:     3.079 ms (0.00% GC)
  --------------
  samples:          4054
  evals/sample:     1


single index
------------
BenchmarkTools.Trial: 
  memory estimate:  784.08 KiB
  allocs estimate:  33
  --------------
  minimum time:     1.104 ms (0.00% GC)
  median time:      1.159 ms (0.00% GC)
  mean time:        1.291 ms (1.82% GC)
  maximum time:     4.741 ms (48.56% GC)
  --------------
  samples:          3836
  evals/sample:     1


threaded inline
---------------
BenchmarkTools.Trial: 
  memory estimate:  784.08 KiB
  allocs estimate:  33
  --------------
  minimum time:     1.105 ms (0.00% GC)
  median time:      1.355 ms (0.00% GC)
  mean time:        1.389 ms (0.67% GC)
  maximum time:     5.519 ms (0.00% GC)
  --------------
  samples:          3574
  evals/sample:     1


@johnomotani
Copy link
Collaborator Author

Update: I've been doing more testing on threading in various ways with Julia, and just had a chat with Joseph about the options.

Thread-based parallelism the simple way (using @threads on a loop) has some overhead. I have the impression that Julia's threading is intended for task-based parallelism where you set of a bunch of things that might take different amounts of time, but each task takes a relatively long time. The system is very flexible, which is great but probably not really what we need.

I tried for a while to get something where the threads would be started at the beginning of the simulation and keep running the whole time - a sort of 'MPI-lite'. This needs some way to synchronise running threads though - OpenMP has one but Julia doesn't and I failed to write one that is efficient (see so-far-unanswered Discourse question here.

This has lead us to think that MPI may be the best way to go after all. MPI does allow shared-memory arrays (within a node) - someone (I think Joseph or Peter Hill?) pointed me to that a while back. So we could do a hierarchy of parallism using MPI:

  1. Within a (NUMA-region of a) node have a group of processes working on shared-memory arrays.
    • If we ever want to use GPUs, set up so that there's one GPU for each group, and send work to it from rank-0 of the group.
  2. Communicate by halo-swaps between groups - probably do the communications just from rank-0 of each group, to minimise the number of messages.

MPI.jl does have support for shared arrays (although minimal documentation). It's not super elegant (or Julian), but for us should be fine, since we only need to allocate arrays at the beginning and keep them until the end of the simulation. We can hide the nasty bits (wrapping up a raw pointer into a Julia array) in an allocation function. There are some MPI.Win objects that we have to keep around and call MPI.free() on when we're done with the arrays - I think it's probably best to just stick them all in a global(-ish) list and then have some sort of finalize() function that we can call at the end of run_moment_kinetics(). I have a 'hello-world' type version of allocate/finalize functions working already.

@johnomotani johnomotani changed the title Threading and loop performance Parallelising and loop performance Sep 16, 2021
@johnomotani
Copy link
Collaborator Author

johnomotani commented Nov 8, 2021

I've made some progress debugging. The parallel version seems to be working now. I checked the strong scaling performance for an nz=81, nvpa=241 grid ('sound wave' case in a periodic domain, the different cases are: 1-4 finite difference with 0, 1, 2, 3 moments evolved separately; 5-8 chebyshev pseudospectral...).
sound_wave_strong_scaling
sound_wave_efficiency
[Sorry the figures aren't pretty - I think if you click on them you should at least see a version big enough to read the axis labels]
This is just the time_advance!() function (i.e. ignoring initialization). It's not bad considering it's a fairly small grid - run time does keep going down at least up to 32 cores. There are at least one or two sections of the code that are still serial - I didn't expect them to take much time, but maybe they become significant on large numbers of cores. Also the load balance won't be perfect, since the number of processors does not exactly divide the number of grid points - I should work out what that means we should expect for efficiency...

Todo before I make a PR:

  • Try profiling the code to see if there are any obvious bottlenecks (e.g. the serial sections).
  • Rebase the changes to a logical set of commits.
  • Fix the tests - seems to be some issue with MPI (configuration?) on the test servers at the moment.
  • Add a CI job that uses the new debugging code to check for race conditions.
  • Edit: also documentation...

@johnomotani
Copy link
Collaborator Author

I tried profiling some parallel runs, and I'm a bit confused... The profiler claims that a 24 core run reports ~ 86% of samples in the MPI.Barrier() call so I don't know how it's getting ~45% efficiency! And the 48 core run reports ~98% of samples in MPI.Barrier(). Unless the MPI calls really don't like profiling for some reason... So I don't think I really understand what's going on, but it seems likely that the thing limiting the scaling is just the MPI.Barrier() calls (that are called from inside our block_synchronize() call).

I think it might be possible to reduce the number of block_synchronize() calls (at least a bit) by re-ordering operations, but I think that's premature optimization at this point... The scaling will be better when we have 4d arrays!

@JosephThomasParker
Copy link
Collaborator

What are you profiling with? Are the samples taken on all cores, or just one?

I agree it's probably premature though!

@johnomotani
Copy link
Collaborator Author

Profiling with Julia's built-in sampling profiler, and using StatProfilerHTML.jl to visualise the results.

Samples taken on all processes, and saved in separate directories (the developer actually added that feature this week because I requested it for this!). I checked the first and last processes.

@johnomotani
Copy link
Collaborator Author

johnomotani commented Nov 16, 2021

I think I'm nearly done now - PR coming shortly (fingers crossed!).

Scaling looks pretty similar to before (see below), but I fixed a couple of regressions in the serial performance:

  • apparently
    for is in composition.species_local_range, iz in z.outer_loop_range, ivpa in 1:vpa.n
        scratch[istage+1].pdf[ivpa,iz,ivpa] = ...
    end
    
    is significantly slower than
    new_scratch = scratch[istage+1]
    for is in composition.species_local_range, iz in z.outer_loop_range, ivpa in 1:vpa.n
        new_scratch.pdf[ivpa,iz,ivpa] = ...
    end
    
    I guess even though istage+1 doesn't change in the loop, the compiler can't absolutely guarantee that, so has to do an extra pointer lookup or two in each loop iteration. We didn't have this issue before because the affected places were 'dot-broadcast' operations rather than explicit loops, which had the same effect (that the scratch[istage+1] lookup was done only once).
  • The shared memory arrays I'm using are created by MPI as 'shared window' objects. Each 'shared window' has its own communicator (a copy, not just a reference, even when lots of arrays are created with the same communicator). I suspect that creating a large number of communicators slows things down (I don't have a clue why - maybe it uses a lot of memory, or makes some MPI lookup table large??). When trying to track down the slow-down, I got to a point where just creating the shared-memory arrays, and keeping the 'shared window' objects stored in a Vector, without using the shared-memory (I modified the array allocation function to just return a Julia array Array{T}(undef, dims...)), and not calling any MPI functions, slowed down serial runs by about 10%. Still can't wrap my head around why. There had been a very large number of arrays, because advection_info structs contained shared arrays, and we were creating arrays of advection_info structs. Adding a dimension to the arrays inside each advection_info struct reduces the number of shared arrays (at the cost of having to pass an extra index around in a few places), as there's now just one advection_info struct per species for each dimension. That was needed to get the CI tests to work (the MPI on Github's servers apparently only allows ~2000 MPI communicators => max ~2000 shared-memory arrays), and by good luck fixed this performance regression too.
    • Another possible workaround would be to do something like create one large shared-memory allocation, and then create many pointers into it, wrapping them in Julia arrays, for all the shared-memory arrays that we need. I wouldn't suggest doing this for the moment because (a) it's more work and (b) it creates more places for a bug to be introduced. Just to write down in case we ever need to do this: one way to keep the logic of how big a shared-memory pool to create relatively simple would be to have a wrapper array-like type
      SharedMemoryArray{T,N} <: AbstractArray{T,N} where {T,N}
          data::Array{T,N}
      end
      
      The wrapper types could be created on all processes, but with values initialized only on block_rank[] == 0, during the setup-phase and registered in a global list. Then at the beginning of time_advance!(), add up the size of all the arrays, allocate the shared-memory pool, create a pointer for each array, copy the data on rank-0 into the shared memory and replace the data Array with an Array that wraps the pointer into the shared-memory pool. This solution would require we have twice as much memory as we need, probably a slighly more complicated solution could get rid of that by separating declaration and initialization steps.

Latest scaling plots (cyan line is the 'final' version; note that the green line that looks like it has higher efficiency in a couple of subplots is just because its serial run was slower!).
sound_wave_strong_scaling
sound_wave_efficiency

The efficiency is significantly better (84% vs 74% for 4 procs; 64% vs 40% for 24 procs; 40% vs 22% for 48 procs) for a 'double-resolution' (4x number of elements/grid-points) simulation, which is hopefully a good sign for 2D2V simulations! Plots for the 'double resolution' case:
sound_wave-2xres_strong_scaling
sound_wave-2xres_efficiency
Not sure what's up with the high-proc-count, chebyshev-derivative cases!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants