-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update threading example #760
Comments
Hi, one can use the What follows is a sketch of the threaded assembley example, when one uses the design guidelines from the blogpost, i.e. tasks shall not depend on threads. Using channels, the number of
Of course there is always the performance question. Here are some benchmarks from my own implementation. I used 8 Threads with a cube of 8^3 linear elements which translates to 2187 dofs. This is the assembley using channels
This is the current assembley
Disregard this proposal as you please |
If one compares the minimum, the performance is notably worse for some reason. PS: you can post the benchmark as a "code" block that preserves the formatting. You just have to wrap the stuff in three of those ``` before and after |
Thanks for your suggestion. Using Channels is a nice solution! It is also possible to only spawn fewer tasks and let each task keep their own buffer, and then use chunks via the channels approach, see fredriks branch here: https://github.com/Ferrite-FEM/Ferrite.jl/blob/0e3f4d20faec741e5da05a2f9e60c67910b8ef3a/docs/src/literate/threaded_assembly.jl One option would be to keep the "how-to" as it currently is using |
Just want to highlight again that the metric maybe matters here:
from https://arxiv.org/pdf/1608.04295.pdf Which is already true for the first textual benchmarks you showed, so maybe its even worse than that |
Would be interesting to see the results for a larger mesh, since, as you say, there are only 64 cells in each color, this gives only 4 cells per thread (for 16). I would try 30x30x30, and perhaps just use the regular |
Hi, the chunk-loop approach sounds interessting; worth considering. Since my usecase consists of some heavy lifting in each element, small overhead in the design of the threaded assembley is manageable. I just wanted to implement threaded assembley according to julia-design standards, inspired by this, your, issue. Now considering the threaded assembly example: For a new user, any form of threading is way better than none; especially if it is simple. Channels get rid of the issues, mentioned in the blog post you shared, without introducing a new dependency. Many greetings |
I find this pattern pretty easy to understand too: Ferrite.jl/docs/src/literate/threaded_assembly.jl Lines 127 to 142 in 0e3f4d2
|
This is the same benchmark as before, but using the minimum runtime, instead of the median. This looks much nicer, I have to say. |
Here comes the requested benchmark. This time having the single-time-measurement statistics. Looks pretty awful. |
However, the degree of awfulness in runtim is a moot point, since both versions (using :static and using channels, respectively) behave similarly. My proposal for threaded assembley is thus finalized. Decide however you want, I simply felt like reporting a solution, since this thread made me think about my own implementation. Thanks to everyone, partaking in the discussion. |
Another thing that came into my mind was whether it might be beneficial to use atomics over the coloring. |
The threaded assembly in our example is not the recommended way to do things anymore (it is still correct though, AFAIU).
See PSA: Thread-local state is no longer recommended
One option could be to use ChunkSplitters.jl.
The text was updated successfully, but these errors were encountered: