Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Atomic operations on array elements #32455

Closed
chengchingwen opened this issue Jun 30, 2019 · 24 comments · Fixed by #37847
Closed

Atomic operations on array elements #32455

chengchingwen opened this issue Jun 30, 2019 · 24 comments · Fixed by #37847
Labels
multithreading Base.Threads and related functionality

Comments

@chengchingwen
Copy link
Member

The atomic methods for Atomic call the pointer_from_objref and call the function on with llvmcall. If we expose the pointer api, we can do atomic operation on array more easily. Are there any reason to dispatch the atomic operation only on Atomic type?

@JeffBezanson
Copy link
Member

Ah, I see, this differs from #29943. That is asking for atomic operations on pointers, while this is about doing atomic operations via pointers to other structures like arrays. I will try to clarify the title.

@JeffBezanson JeffBezanson changed the title Expose the atomic methods for pointer? atomic operations on array elements and object fields Dec 16, 2019
@oschulz
Copy link
Contributor

oschulz commented Jun 13, 2020

This recently came up here: https://discourse.julialang.org/t/parallelizing-a-histogram/41260/9

Would be great to have atomic_add!(A::Array, x, idxs), etc.

@tkf
Copy link
Member

tkf commented Jun 15, 2020

Quoting my comment in the discourse thread:

OK, so I uploaded the code I used for trying out atomic increments on array elements: https://github.com/tkf/ParallelIncrements.jl

(Comment added: I just realized that the name of the package is completely misleading since it was about single thread performance. Initially, I was thinking about trying it with multiple threads.)

This package contains a function ref = atomicref(::Array, indices...) which can be used to for atomic operations like Threads.atomic_add!(ref, x).

As you can see in the README, the overhead is 10x in my laptop. It can go up to 100x and possibly more depending on the kind of optimization the compiler can do. It might be possible to get a lower overhead using other memory orderings but I haven't tried it yet.

I'm not a big fan in this direction but I'm interested in whatever beneficial for HPC in Julia. So, I'm just sharing it just in case some people want to try it out (and maybe find some performance bugs I made in the sample code or something).

@oschulz
Copy link
Contributor

oschulz commented Jun 15, 2020

Came across this - that's on a GPU, though: https://devblogs.nvidia.com/gpu-pro-tip-fast-histograms-using-shared-atomics-maxwell/

@oschulz
Copy link
Contributor

oschulz commented Jun 15, 2020

A slowdown of 10x may actually be quite acceptable in some cases, compared to the alternative.

Obviously, when you deal with things like small histograms (histograms are just a nice example use case for this), it's most efficient for every thread to histogram separately in 1st level cache, and then merge after. But if the histogram is large (multi-dimensional, many bins), atomics with a 10x overhead (compared to non-atomic writes may still be faster) than maintaining one histogram per thread, especially if the number of threads is large. Just a hunch, of course. I guess it'll come down to memory bandwidths, latencies and caching behavior in the individual use case ... hard to predict.

@tfk, to you think the atomicref comes with some additional overhead, compared to having an atomic_add that directly operates on array and index, or will the compiler optimize that away?

@tkf
Copy link
Member

tkf commented Jun 15, 2020

I agree there might be some situations that this can be beneficial. I uploaded the code because I thought it'd be great if people can explore the use of it.

But, let me also note that my benchmark is about single-thread performance. If you are using multiple threads, atomic can hurt even more because it has to invalidate the cache of all CPUs. So, I'd guess 10x is more like a lower bound of the overhead.

atomicref comes with some additional overhead, compared to having an atomic_add that directly operates on array and index

I didn't see anything suspicious about it when I quickly look at LLVM IR.

(BTW, you might want to use @tkf to ping me, to avoid pinging another user.)

@oschulz
Copy link
Contributor

oschulz commented Jun 15, 2020

If you are using multiple threads, atomic can hurt even more because it has to invalidate the cache of all CPUs.

Yes, I've been thinking about that - however, if the access pattern is scattered access to a fairly large amount of memory, the threads would, for the most part, not share cache lines, right?

@tkf
Copy link
Member

tkf commented Jun 15, 2020

Ah, yes, you might be right.

@vtjnash
Copy link
Member

vtjnash commented Jun 15, 2020

For a histogram-type operation, I'd expect a (histogram-shaped distribution) of cache conflicts, resulting in a large slowdown for each additional thread you add—starting at 10x slower just for going from single-threaded to 1 thread-but-threadsafe-with atomics. Unlike @tkf, I have no numbers though, and I'm just somewhat parroting his comment.

This is actually relevant, because I'm currently thinking about not (yet) having atomics for individual array elements (just too likely to be slow and/or incorrect), so that you should perhaps instead just use a lock over the whole array. https://docs.google.com/document/d/e/2PACX-1vT7Ibthj9WyM8s5bcQbiKsVK6MtvzqmnPFMy-bcjZLlbqv55a0_sTJ99AkbvPIZk3t7MbhZ57NzaIzC/pub

@vtjnash
Copy link
Member

vtjnash commented Jun 15, 2020

(note: the idealized cost of lock is on the order of a couple atomic operations, since that's about all it takes to implement one: it's important to mentally realize that the cost of a lock and an atomic are therefore somewhat nearly the same.)

@oschulz
Copy link
Contributor

oschulz commented Jun 15, 2020

Well, you kinda dash my hopes for efficient scattered atomic adds (maybe I can try to do some parallel benchmarking to get us numbers, though). :-)

But I'm very excited about the ideas in your manifest!

@tkf
Copy link
Member

tkf commented Jun 15, 2020

I wonder what's the best strategy for implementing histogram-like function with a very large output array using threads. Maybe (1) create a batch of local index list (for the entries to be incremented) in each thread grouped by sub-regions of the output array and then (2) send it to the "writer" tasks that write it to the sub-region of the output array? This way, I guess we can distribute the contention a bit by dividing it by the number of sub-regions (the contention in this strategy would be coming from the queue for communicating with each writer task).

@oschulz
Copy link
Contributor

oschulz commented Jun 15, 2020

EDIT: I messed something up here, see below for corrections. Using a global lock is not fast, for this use case.

@vtjnash, you were right, of course: Using a global lock is indeed fastest. Here's a benchmark I ran
(using Julia v1.5.0-beta1.0):

https://gist.github.com/oschulz/dce9d2f2104deb2ff42edaa814bb3790

I'm filling a histogram with 201 x 201 x 201 bins with 10^6 (randn(rng), randn(rng), randn(rng)) variates, using a separate Random123.Philox4x RNG for each thread.

I defined

Base.@propagate_inbounds function atomic_addindex!(A::Array{Int64}, v::U, I...) where {U}
    T = Int64
    i = LinearIndices(size(A))[I...]
    v_conv = convert(T, v)
    ptr = pointer(A, i)
    Base.Threads.llvmcall("%ptr = inttoptr i64 %0 to i64*\n%rv = atomicrmw add i64* %ptr, i64 %1 acq_rel\nret i64 %rv\n", T, Tuple{Ptr{T},T}, ptr, v_conv)::T
    A
end

I hope that's correct.

With JULIA_NUM_THREADS=64 (Epyc-2 CPU, 64 physical cores, single socket), I get:

  • No multi-threading: 205.771 ms
  • Using atomic_addindex!: 519.649 ms
  • Using a global SpinLock(): 4.784 ms

So with the global lock, I get a 43x speedup on 64 threads - I wouldn't have expected it to scale that well.

The random number generation itself is about 20 ms using a thread and 630 μs using all 64 threads, so it's only a relatively small fraction of the total time.

When I run with JULIA_NUM_THREADS=1:

  • No multi-threading: 203.243 ms ms
  • Using atomic_addindex!: 224.736 ms
  • Using a global SpinLock(): 195.301 ms

So if there's a bit of work to do (computing the random numbers and hist-bin index), the overhead of atomic_addindex! has little impact, single threaded. But with many threads, even on a large array, atomic access seems to cause way too much cache invalidation/contention.

Yay for global locks! :-)

EDIT: I messed something up here, see below for corrections. Using a global lock is not fast, for this use case.

@biochem-fan
Copy link

@oschulz I looked at your code on Gist.

Your fill_randn_mt_atomic! actually uses a global SpinLock, while fill_randn_mt_lock! uses atomic_push!. Is this intentional?

Where is your atomic_addindex! used? Shouldn't your atomic_push! call it?

@biochem-fan
Copy link

@oschulz I confirmed that without using atomic_addindex!, the result is not correct.

I added sanity checks https://gist.github.com/biochem-fan/19072418e90dd783227a347efe892860/revisions and repaired function names.

The results with 24 threads on Xeon E5-2643 v2 are:

  • No multi-threading: 255.538 ms
  • Atomic add: 38.816 ms
  • Global spin lock: 578.194 ms

So, atomic add is the fastest.

@oschulz
Copy link
Contributor

oschulz commented Apr 8, 2021

Hi @biochem-fan , thanks for catching this. No idea what I did back there, looks like I definitely mixed the function names up and my atomic_addindex! isn't even used in my old version my Gist. Maybe I accidentally used and uploaded a debugging state of the code or so - must have been late ... sorry!

In any case, I just updated my Gist (https://gist.github.com/oschulz/dce9d2f2104deb2ff42edaa814bb3790) to use atomic_addindex! (similar to your change) and to use locking only during the actual memory manipulation (for locked_push!).

Here's some benchmarks on a 128-core machine (2x AMD EPYC 7662 64-core). The locking approach clearly doesn't scale, while atomic_addindex! does scale, though not linearly.

(We also have to take reduced CPU clock boost under heavy multi-threading into account, CPU ran at 3.3 GHz during fill_randn_mt_atomic! with 1 thread, but only at 2.4 GHz average with 64 threads).

Using 1 threads (mean time):

  • gen_randn_st: 12.353 ms
  • gen_randn_mt: 12.271 ms
  • fill_randn_st!: 122.237 ms
  • fill_randn_mt_atomic!: 124.936 ms
  • fill_randn_mt_lock!: 147.139 ms

Using 2 threads (mean time):

  • gen_randn_st: 12.383 ms
  • gen_randn_mt: 6.404 ms
  • fill_randn_st!: 123.549 ms
  • fill_randn_mt_atomic!: 67.433 ms
  • fill_randn_mt_lock!: 145.086 ms

Using 4 threads (mean time):

  • gen_randn_st: 12.395 ms
  • gen_randn_mt: 4.330 ms
  • fill_randn_st!: 122.585 ms
  • fill_randn_mt_atomic!: 35.574 ms
  • fill_randn_mt_lock!: 161.797 ms

Using 8 threads (mean time):

  • gen_randn_st: 12.326 ms
  • gen_randn_mt: 3.102 ms
  • fill_randn_st!: 122.782 ms
  • fill_randn_mt_atomic!: 23.418 ms
  • fill_randn_mt_lock!: 300.381 ms

Using 16 threads (mean time):

  • gen_randn_st: 12.154 ms
  • gen_randn_mt: 1.648 ms
  • fill_randn_st!: 123.109 ms
  • fill_randn_mt_atomic!: 15.220 ms
  • fill_randn_mt_lock!: 424.703 ms

Using 32 threads (mean time):

  • gen_randn_st: 12.397 ms
  • gen_randn_mt: 911.524 μs
  • fill_randn_st!: 122.468 ms
  • fill_randn_mt_atomic!: 8.906 ms
  • fill_randn_mt_lock!: 455.146 ms

Using 64 threads (mean time):

  • gen_randn_st: 12.153 ms
  • gen_randn_mt: 647.767 μs
  • fill_randn_st!: 121.362 ms
  • fill_randn_mt_atomic!: 5.512 ms
  • fill_randn_mt_lock!: 503.848 ms

Using 128 threads (mean time):

  • gen_randn_st: 12.120 ms
  • gen_randn_mt: 639.225 μs
  • fill_randn_st!: 149.159 ms
  • fill_randn_mt_atomic!: 5.153 ms
  • fill_randn_mt_lock!: 844.348 ms

I don't understand the results for 128 threads, though - even though things are spread over two NUMA domains now, it should be faster than with 64 threads, at least the pure RNG generation should scale well. I did use thread pinning, so each thread should have a separate physical core (no hyper-threading). Can Julia somehow only schedule 64 threads even if started with 128?

@biochem-fan
Copy link

@oschulz Thank you very much for confirmation and more test results.

Indeed your result for 128 threads is interesting. Unfortunately I don't have access to machines with so many cores and I cannot try it.

This new results demonstrate that atomic_addindex! is useful at least in some cases. I am new to Julia developer community. How should we move forward? PR #37683 seems stale.

Until atomic_addindex! gets incorporated to Julia, may I use your implementation in my projects?

@oschulz
Copy link
Contributor

oschulz commented Apr 8, 2021

Until atomic_addindex! gets incorporated to Julia, may I use your implementation in my projects?

Of course!

I agree that having something like atomic_addindex! in Julia would be beneficial. I think the Julia dev team has some long-term plans regarding parallel access and memory mutability in general, but I don't know any details. But having something simple like atomic_addindex! IMHO couldn't hurt.

Alternatively, this could go into a little package. I'm not sure how fragile this would be, with the LLVM intrinsics, regarding newer Julia (and therefore newer LLVM versions). The advantage would be that other packages could use it without depending on a newer Julia version.

@oschulz
Copy link
Contributor

oschulz commented Apr 8, 2021

Maybe a little package (e.g. "AtomicArrays.jl") would be best for now.

@jpsamaroo
Copy link
Member

Does #37847 actually add atomic array accesses? I don't see any tests for this functionality, although I could just be overlooking them.

@vtjnash
Copy link
Member

vtjnash commented Jun 3, 2021

No, I eventually noticed that there aren't often algorithms that would benefit from it. Though internally, we use structures that benefit from it: the access pattern used by IdDict (and others), internally already do benefit from something similar. However, I'm already proposing making that particular usage (concurrent-reader-with-writer, single-writer-with-lock) access pattern safe by default, such that it wouldn't require anything different. But since that can now be simulated with an atomic-release fence, it isn't strictly necessary to add either.

@tkf
Copy link
Member

tkf commented Jun 18, 2021

I don't think this issue is fully addressed. Let me reopen this issue since I believe it's reasonable to keep tracking it (but, of course, I'm open to be convinced otherwise).

I do agree that supporting atomic operations on array element can be misleading for people who are not familiar with parallel programming. The best approach is to create per-task mutable states and merge them wisely. I'm very serious about this to the degree that I build a GitHub org JuliaFolds and a series of posts (https://juliafolds.github.io/data-parallelism/) around this principle.

Having said that, I believe that atomic operations on arrays are an important ingredient for trickier and interesting parallel and concurrent programs. I think addressing this issue would be important for Julia to keep staying relevant in high-performance technical computing.

there aren't often algorithms that would benefit from it

I don't believe this is the case. There are plenty of practical nonblocking algorithms that require, e.g., CAS on an entry of an array. For example:

These data structures would be relevant in high-performance programming access pattern is sparse (= very little contention) and/or resulting object is relatively large compared to available memory.

An extreme example is GPU where there are just so many threads and so moderate sized object (in CPU standard) cannot be allocated per-thread basis. This necessitates the use of atomics when computing, e.g., histogram. Indeed, GPU Pro Tip: Fast Histograms Using Shared Atomics on Maxwell | NVIDIA Developer Blog does not even have "non-atomic" example and explains why adding atomics to a more level of memory was beneficial.

Edit: I forgot to mention that when it comes to asynchronous updates on a large data structure, Chaotic Relaxation (Chazan and Miranker, 1969) is also an interesting example. There are also relatively recent examples of similar concepts in stochastic gradient descent (Recht et al., 2011) and parallel graph analytics (Lenharth, Nguyen, and Pingali (2016)).

@tkf tkf reopened this Jun 18, 2021
@tkf
Copy link
Member

tkf commented Nov 13, 2021

FWIW, just as a stopgap measure until we get this in julia proper, there's https://github.com/JuliaConcurrent/UnsafeAtomics.jl. There's also a wrapper https://github.com/JuliaConcurrent/AtomicArrays.jl but it's not yet registered.

@tkf tkf changed the title atomic operations on array elements and object fields Atomic operations on array elements Nov 13, 2021
@vtjnash vtjnash closed this as completed Oct 31, 2024
@oscardssmith
Copy link
Member

Closed by AtomicMemory on 1.11/1.12

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
multithreading Base.Threads and related functionality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants