Update threading example #760

KnutAM · 2023-07-09T11:57:57Z

The threaded assembly in our example is not the recommended way to do things anymore (it is still correct though, AFAIU).

See PSA: Thread-local state is no longer recommended

One option could be to use ChunkSplitters.jl.

Firimo · 2023-08-01T17:17:46Z

Hi,

one can use the Channel object in Julia for storing the ScratchValues object from the threaded assembley example. This way the issues with nthreads() and threadid() explained in the first footnote of the blog-post that you shared would be solved.

What follows is a sketch of the threaded assembley example, when one uses the design guidelines from the blogpost, i.e. tasks shall not depend on threads. Using channels, the number of ScratchValue objects is independent from the number of threads (or tasks), because if a Channel is empty, then the task will simply wait for the Channel to fill up again.

function doassemble(K::SparseMatrixCSC, colors, grid::Grid, dh::DofHandler, C::SymmetricTensor{4, dim}) where {dim}

    f = zeros(ndofs(dh))
    b = Vec{3}((0.0, 0.0, 0.0)) # Body force
    
    # the number of objects, that should fill the channel
    n_obj = 8 # could also be Threads.nthreads(), or whatever

    # create the channel
    scratches = Channel{ScratchValues}(n_obj)

    # fill the channel (assuming, the constructor only creates a single object)
    for _ in 1:n_obj
         put!(scratches,create_scratchvalue(K,f,dh))
    end

    for color in colors
        tasks = map(color) do cellid
            Threads.@spawn begin
                # take what you want
                scratch = take!(scratches)
                
                assemble_cell!(scratch, cellid, K, grid, dh, C, b)

                # but you have to give it back
                put!(scratches,scratch)
            end
        end

        # be patient! you have to let the tasks run to the end, otherwise your timing benchmarks will be marvelous, but you will not get the correct result
        wait.(tasks)
    end

    return K, f
end

Of course there is always the performance question. Here are some benchmarks from my own implementation. I used 8 Threads with a cube of 8^3 linear elements which translates to 2187 dofs. ~~Sadly github screws up the benchmark histograms, so I couldn't post them~~. To me, the performance looks pretty much equivalent, this will not be a bottleneck for me.

This is the assembley using channels

BenchmarkTools.Trial: 14 samples with 1 evaluation.
 Range (min … max):  332.687 ms … 432.290 ms  ┊ GC (min … max): 47.88% … 65.26%
 Time  (median):     353.018 ms               ┊ GC (median):    58.12%
 Time  (mean ± σ):   359.041 ms ±  22.796 ms  ┊ GC (mean ± σ):  58.05% ±  3.56%

  ▁       █ ▁██ ▁  ▁█ ▁                                       ▁
  █▁▁▁▁▁▁▁█▁███▁█▁▁██▁█▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁█ ▁
  333 ms           Histogram: frequency by time          432 ms <


 Memory estimate: 437.39 MiB, allocs estimate: 15408784.

This is the current assembley

BenchmarkTools.Trial: 14 samples with 1 evaluation.
 Range (min … max):  168.072 ms … 403.001 ms  ┊ GC (min … max):  0.00% … 64.18%
 Time  (median):     380.939 ms               ┊ GC (median):    61.88%
 Time  (mean ± σ):   367.882 ms ±  58.063 ms  ┊ GC (mean ± σ):  60.23% ± 16.66%

                                                       ▁█▄  ▁
  ▆▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁███▆▁█▁▆ ▁
  168 ms           Histogram: frequency by time          403 ms <


 Memory estimate: 436.49 MiB, allocs estimate: 15404044.

Disregard this proposal as you please
Kind regards

koehlerson · 2023-08-01T18:14:34Z

If one compares the minimum, the performance is notably worse for some reason.

PS: you can post the benchmark as a "code" block that preserves the formatting. You just have to wrap the stuff in three of those ``` before and after

KnutAM · 2023-08-02T08:55:11Z

Thanks for your suggestion. Using Channels is a nice solution!
A problem is that the scaling is not very good (in my benchmarks) for this approach (I believe because each put! and take! locks so all other tasks have to wait if there is a queue)
One solution (that I use in other places) is another loop, where you loop over a chunk of elements without taking and putting from/to the Channel. But this code adds a bit of overhead code, that makes it less readable for users not so familiar with julia/multithreading.

It is also possible to only spawn fewer tasks and let each task keep their own buffer, and then use chunks via the channels approach, see fredriks branch here: https://github.com/Ferrite-FEM/Ferrite.jl/blob/0e3f4d20faec741e5da05a2f9e60c67910b8ef3a/docs/src/literate/threaded_assembly.jl

One option would be to keep the "how-to" as it currently is using :static, but warn that it is not "the best way", and doesn't allow nested threaded tasks. And then provide something as you have written, but with chunking (perhaps with ChunkSplitters) as an advanced threading example that we can refer to. Feel free to open a PR if you are interested in working on that! (Could also make sense to homogenize the two, to allow most functions to be re-used, for example, the create_scratchvalues function) and all the setup)

Firimo · 2023-08-02T13:10:16Z

Hi,

so I did a benchmark of the assembley of a box with 8x8x8 linear hexahedral elements. The nice thing is, that each color has exactly 64 cells (using the greedy algorithm). I used [1,2,4,8,16,32,64] threads. I do not understand, why the time gets worse at 16 threads.

koehlerson · 2023-08-02T13:40:08Z

Just want to highlight again that the metric maybe matters here:

Our results suggest that using the minimum estimator
for the true run time of a benchmark, rather than the mean or
median, is robust to nonideal statistics and also provides the
smallest error. Our model also revealed some behaviors that
challenge conventional wisdom: simply running a benchmark
for longer, or repeating its execution many times, can render
the effects of external variation negligible, even as the error
due to timer inaccuracy is amortized.

from https://arxiv.org/pdf/1608.04295.pdf

Which is already true for the first textual benchmarks you showed, so maybe its even worse than that

KnutAM · 2023-08-02T13:47:56Z

I do not understand, why the time gets worse at 16 threads.

Would be interesting to see the results for a larger mesh, since, as you say, there are only 64 cells in each color, this gives only 4 cells per thread (for 16). I would try 30x30x30, and perhaps just use the regular @elapsed as the running time is long enough.

Firimo · 2023-08-02T13:49:15Z

Thanks for your suggestion. Using Channels is a nice solution! A problem is that the scaling is not very good (in my benchmarks) for this approach (I believe because each put! and take! locks so all other tasks have to wait if there is a queue) One solution (that I use in other places) is another loop, where you loop over a chunk of elements without taking and putting from/to the Channel. But this code adds a bit of overhead code, that makes it less readable for users not so familiar with julia/multithreading.

It is also possible to only spawn fewer tasks and let each task keep their own buffer, and then use chunks via the channels approach, see fredriks branch here: https://github.com/Ferrite-FEM/Ferrite.jl/blob/0e3f4d20faec741e5da05a2f9e60c67910b8ef3a/docs/src/literate/threaded_assembly.jl

One option would be to keep the "how-to" as it currently is using :static, but warn that it is not "the best way", and doesn't allow nested threaded tasks. And then provide something as you have written, but with chunking (perhaps with ChunkSplitters) as an advanced threading example that we can refer to. Feel free to open a PR if you are interested in working on that! (Could also make sense to homogenize the two, to allow most functions to be re-used, for example, the create_scratchvalues function) and all the setup)

Hi,

the chunk-loop approach sounds interessting; worth considering. Since my usecase consists of some heavy lifting in each element, small overhead in the design of the threaded assembley is manageable. I just wanted to implement threaded assembley according to julia-design standards, inspired by this, your, issue.

Now considering the threaded assembly example: For a new user, any form of threading is way better than none; especially if it is simple. Channels get rid of the issues, mentioned in the blog post you shared, without introducing a new dependency.

Many greetings

fredrikekre · 2023-08-02T13:53:25Z

I find this pattern pretty easy to understand too:

Ferrite.jl/docs/src/literate/threaded_assembly.jl

Lines 127 to 142 in 0e3f4d2

    
           for color in colors 
        
               @sync begin 
        
                   workqueue = Channel{Int}(Inf) 
        
                   foreach(x -> put!(workqueue, x), color) 
        
                   close(workqueue) 
        
                   ## Each color is safe to assemble threaded 
        
                   for _ in 1:n_threads # Threads.nthreads() 
        
                       Threads.@spawn begin 
        
                           local scratch = create_scratchbuffer(K, f, dh) 
        
                           for i in workqueue 
        
                               assemble_cell!(scratch, i, grid, dh, C, b) 
        
                           end 
        
                       end 
        
                   end 
        
               end 
        
           end

Firimo · 2023-08-02T15:14:49Z

Just want to highlight again that the metric maybe matters here:

Our results suggest that using the minimum estimator
for the true run time of a benchmark, rather than the mean or
median, is robust to nonideal statistics and also provides the
smallest error. Our model also revealed some behaviors that
challenge conventional wisdom: simply running a benchmark
for longer, or repeating its execution many times, can render
the effects of external variation negligible, even as the error
due to timer inaccuracy is amortized.

from https://arxiv.org/pdf/1608.04295.pdf

Which is already true for the first textual benchmarks you showed, so maybe its even worse than that

This is the same benchmark as before, but using the minimum runtime, instead of the median. This looks much nicer, I have to say.

Firimo · 2023-08-02T15:29:00Z

I do not understand, why the time gets worse at 16 threads.

Would be interesting to see the results for a larger mesh, since, as you say, there are only 64 cells in each color, this gives only 4 cells per thread (for 16). I would try 30x30x30, and perhaps just use the regular @elapsed as the running time is long enough.

Here comes the requested benchmark. This time having the single-time-measurement statistics. Looks pretty awful.

Firimo · 2023-08-02T15:40:36Z

However, the degree of awfulness in runtim is a moot point, since both versions (using :static and using channels, respectively) behave similarly. My proposal for threaded assembley is thus finalized. Decide however you want, I simply felt like reporting a solution, since this thread made me think about my own implementation.

Thanks to everyone, partaking in the discussion.

termi-official · 2023-10-03T07:32:32Z

Another thing that came into my mind was whether it might be beneficial to use atomics over the coloring.

termi-official added the docs label Nov 14, 2023

KnutAM linked a pull request Sep 25, 2024 that will close this issue

Modernize the threaded assembly howto #1068

Merged

fredrikekre closed this as completed in #1068 Sep 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update threading example #760

Update threading example #760

KnutAM commented Jul 9, 2023

Firimo commented Aug 1, 2023 •

edited

Loading

koehlerson commented Aug 1, 2023

KnutAM commented Aug 2, 2023 •

edited

Loading

Firimo commented Aug 2, 2023

koehlerson commented Aug 2, 2023

KnutAM commented Aug 2, 2023

Firimo commented Aug 2, 2023

fredrikekre commented Aug 2, 2023

Firimo commented Aug 2, 2023 •

edited

Loading

Firimo commented Aug 2, 2023

Firimo commented Aug 2, 2023 •

edited

Loading

termi-official commented Oct 3, 2023

Update threading example #760

Update threading example #760

Comments

KnutAM commented Jul 9, 2023

Firimo commented Aug 1, 2023 • edited Loading

koehlerson commented Aug 1, 2023

KnutAM commented Aug 2, 2023 • edited Loading

Firimo commented Aug 2, 2023

koehlerson commented Aug 2, 2023

KnutAM commented Aug 2, 2023

Firimo commented Aug 2, 2023

fredrikekre commented Aug 2, 2023

Firimo commented Aug 2, 2023 • edited Loading

Firimo commented Aug 2, 2023

Firimo commented Aug 2, 2023 • edited Loading

termi-official commented Oct 3, 2023

Firimo commented Aug 1, 2023 •

edited

Loading

KnutAM commented Aug 2, 2023 •

edited

Loading

Firimo commented Aug 2, 2023 •

edited

Loading

Firimo commented Aug 2, 2023 •

edited

Loading