Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update threading example #760

Closed
KnutAM opened this issue Jul 9, 2023 · 12 comments · Fixed by #1068
Closed

Update threading example #760

KnutAM opened this issue Jul 9, 2023 · 12 comments · Fixed by #1068
Labels

Comments

@KnutAM
Copy link
Member

KnutAM commented Jul 9, 2023

The threaded assembly in our example is not the recommended way to do things anymore (it is still correct though, AFAIU).

See PSA: Thread-local state is no longer recommended

One option could be to use ChunkSplitters.jl.

@Firimo
Copy link

Firimo commented Aug 1, 2023

Hi,

one can use the Channel object in Julia for storing the ScratchValues object from the threaded assembley example. This way the issues with nthreads() and threadid() explained in the first footnote of the blog-post that you shared would be solved.

What follows is a sketch of the threaded assembley example, when one uses the design guidelines from the blogpost, i.e. tasks shall not depend on threads. Using channels, the number of ScratchValue objects is independent from the number of threads (or tasks), because if a Channel is empty, then the task will simply wait for the Channel to fill up again.

function doassemble(K::SparseMatrixCSC, colors, grid::Grid, dh::DofHandler, C::SymmetricTensor{4, dim}) where {dim}

    f = zeros(ndofs(dh))
    b = Vec{3}((0.0, 0.0, 0.0)) # Body force
    
    # the number of objects, that should fill the channel
    n_obj = 8 # could also be Threads.nthreads(), or whatever

    # create the channel
    scratches = Channel{ScratchValues}(n_obj)

    # fill the channel (assuming, the constructor only creates a single object)
    for _ in 1:n_obj
         put!(scratches,create_scratchvalue(K,f,dh))
    end

    for color in colors
        tasks = map(color) do cellid
            Threads.@spawn begin
                # take what you want
                scratch = take!(scratches)
                
                assemble_cell!(scratch, cellid, K, grid, dh, C, b)

                # but you have to give it back
                put!(scratches,scratch)
            end
        end

        # be patient! you have to let the tasks run to the end, otherwise your timing benchmarks will be marvelous, but you will not get the correct result
        wait.(tasks)
    end

    return K, f
end

Of course there is always the performance question. Here are some benchmarks from my own implementation. I used 8 Threads with a cube of 8^3 linear elements which translates to 2187 dofs. Sadly github screws up the benchmark histograms, so I couldn't post them. To me, the performance looks pretty much equivalent, this will not be a bottleneck for me.

This is the assembley using channels

BenchmarkTools.Trial: 14 samples with 1 evaluation.
 Range (min … max):  332.687 ms … 432.290 ms  ┊ GC (min … max): 47.88% … 65.26%
 Time  (median):     353.018 ms               ┊ GC (median):    58.12%
 Time  (mean ± σ):   359.041 ms ±  22.796 ms  ┊ GC (mean ± σ):  58.05% ±  3.56%

  ▁       █ ▁██ ▁  ▁█ ▁                                       ▁
  █▁▁▁▁▁▁▁█▁███▁█▁▁██▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  333 ms           Histogram: frequency by time          432 ms <


 Memory estimate: 437.39 MiB, allocs estimate: 15408784.

This is the current assembley

BenchmarkTools.Trial: 14 samples with 1 evaluation.
 Range (min … max):  168.072 ms … 403.001 ms  ┊ GC (min … max):  0.00% … 64.18%
 Time  (median):     380.939 ms               ┊ GC (median):    61.88%
 Time  (mean ± σ):   367.882 ms ±  58.063 ms  ┊ GC (mean ± σ):  60.23% ± 16.66%

                                                       ▁█▄  ▁
  ▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁███▆▁█▁▆ ▁
  168 ms           Histogram: frequency by time          403 ms <


 Memory estimate: 436.49 MiB, allocs estimate: 15404044.

Disregard this proposal as you please
Kind regards

@koehlerson
Copy link
Member

If one compares the minimum, the performance is notably worse for some reason.

PS: you can post the benchmark as a "code" block that preserves the formatting. You just have to wrap the stuff in three of those ``` before and after

@KnutAM
Copy link
Member Author

KnutAM commented Aug 2, 2023

Thanks for your suggestion. Using Channels is a nice solution!
A problem is that the scaling is not very good (in my benchmarks) for this approach (I believe because each put! and take! locks so all other tasks have to wait if there is a queue)
One solution (that I use in other places) is another loop, where you loop over a chunk of elements without taking and putting from/to the Channel. But this code adds a bit of overhead code, that makes it less readable for users not so familiar with julia/multithreading.

It is also possible to only spawn fewer tasks and let each task keep their own buffer, and then use chunks via the channels approach, see fredriks branch here: https://github.com/Ferrite-FEM/Ferrite.jl/blob/0e3f4d20faec741e5da05a2f9e60c67910b8ef3a/docs/src/literate/threaded_assembly.jl

One option would be to keep the "how-to" as it currently is using :static, but warn that it is not "the best way", and doesn't allow nested threaded tasks. And then provide something as you have written, but with chunking (perhaps with ChunkSplitters) as an advanced threading example that we can refer to. Feel free to open a PR if you are interested in working on that! (Could also make sense to homogenize the two, to allow most functions to be re-used, for example, the create_scratchvalues function) and all the setup)

@Firimo
Copy link

Firimo commented Aug 2, 2023

Hi,

so I did a benchmark of the assembley of a box with 8x8x8 linear hexahedral elements. The nice thing is, that each color has exactly 64 cells (using the greedy algorithm). I used [1,2,4,8,16,32,64] threads. I do not understand, why the time gets worse at 16 threads.

threaded_benchmark

@koehlerson
Copy link
Member

Just want to highlight again that the metric maybe matters here:

Our results suggest that using the minimum estimator
for the true run time of a benchmark, rather than the mean or
median, is robust to nonideal statistics and also provides the
smallest error. Our model also revealed some behaviors that
challenge conventional wisdom: simply running a benchmark
for longer, or repeating its execution many times, can render
the effects of external variation negligible, even as the error
due to timer inaccuracy is amortized.

from https://arxiv.org/pdf/1608.04295.pdf

Which is already true for the first textual benchmarks you showed, so maybe its even worse than that

@KnutAM
Copy link
Member Author

KnutAM commented Aug 2, 2023

I do not understand, why the time gets worse at 16 threads.

Would be interesting to see the results for a larger mesh, since, as you say, there are only 64 cells in each color, this gives only 4 cells per thread (for 16). I would try 30x30x30, and perhaps just use the regular @elapsed as the running time is long enough.

@Firimo
Copy link

Firimo commented Aug 2, 2023

Thanks for your suggestion. Using Channels is a nice solution! A problem is that the scaling is not very good (in my benchmarks) for this approach (I believe because each put! and take! locks so all other tasks have to wait if there is a queue) One solution (that I use in other places) is another loop, where you loop over a chunk of elements without taking and putting from/to the Channel. But this code adds a bit of overhead code, that makes it less readable for users not so familiar with julia/multithreading.

It is also possible to only spawn fewer tasks and let each task keep their own buffer, and then use chunks via the channels approach, see fredriks branch here: https://github.com/Ferrite-FEM/Ferrite.jl/blob/0e3f4d20faec741e5da05a2f9e60c67910b8ef3a/docs/src/literate/threaded_assembly.jl

One option would be to keep the "how-to" as it currently is using :static, but warn that it is not "the best way", and doesn't allow nested threaded tasks. And then provide something as you have written, but with chunking (perhaps with ChunkSplitters) as an advanced threading example that we can refer to. Feel free to open a PR if you are interested in working on that! (Could also make sense to homogenize the two, to allow most functions to be re-used, for example, the create_scratchvalues function) and all the setup)

Hi,

the chunk-loop approach sounds interessting; worth considering. Since my usecase consists of some heavy lifting in each element, small overhead in the design of the threaded assembley is manageable. I just wanted to implement threaded assembley according to julia-design standards, inspired by this, your, issue.

Now considering the threaded assembly example: For a new user, any form of threading is way better than none; especially if it is simple. Channels get rid of the issues, mentioned in the blog post you shared, without introducing a new dependency.

Many greetings

@fredrikekre
Copy link
Member

I find this pattern pretty easy to understand too:

for color in colors
@sync begin
workqueue = Channel{Int}(Inf)
foreach(x -> put!(workqueue, x), color)
close(workqueue)
## Each color is safe to assemble threaded
for _ in 1:n_threads # Threads.nthreads()
Threads.@spawn begin
local scratch = create_scratchbuffer(K, f, dh)
for i in workqueue
assemble_cell!(scratch, i, grid, dh, C, b)
end
end
end
end
end

@Firimo
Copy link

Firimo commented Aug 2, 2023

Just want to highlight again that the metric maybe matters here:

Our results suggest that using the minimum estimator
for the true run time of a benchmark, rather than the mean or
median, is robust to nonideal statistics and also provides the
smallest error. Our model also revealed some behaviors that
challenge conventional wisdom: simply running a benchmark
for longer, or repeating its execution many times, can render
the effects of external variation negligible, even as the error
due to timer inaccuracy is amortized.

from https://arxiv.org/pdf/1608.04295.pdf

Which is already true for the first textual benchmarks you showed, so maybe its even worse than that

This is the same benchmark as before, but using the minimum runtime, instead of the median. This looks much nicer, I have to say.

threaded_benchmark_8x8x8_min

@Firimo
Copy link

Firimo commented Aug 2, 2023

I do not understand, why the time gets worse at 16 threads.

Would be interesting to see the results for a larger mesh, since, as you say, there are only 64 cells in each color, this gives only 4 cells per thread (for 16). I would try 30x30x30, and perhaps just use the regular @elapsed as the running time is long enough.

Here comes the requested benchmark. This time having the single-time-measurement statistics. Looks pretty awful.

threaded_benchmark_30x30x30

@Firimo
Copy link

Firimo commented Aug 2, 2023

However, the degree of awfulness in runtim is a moot point, since both versions (using :static and using channels, respectively) behave similarly. My proposal for threaded assembley is thus finalized. Decide however you want, I simply felt like reporting a solution, since this thread made me think about my own implementation.

Thanks to everyone, partaking in the discussion.

@termi-official
Copy link
Member

Another thing that came into my mind was whether it might be beneficial to use atomics over the coloring.

@KnutAM KnutAM linked a pull request Sep 25, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants