Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add some support for atomic operations on mutable fields and Ptrs #37847

Merged
merged 1 commit into from
Jun 2, 2021

Conversation

vtjnash
Copy link
Member

@vtjnash vtjnash commented Oct 1, 2020

This starts much of the work of implementing Julia Atomics Manifesto for fields. In particular, it adds:

  • the @atomic macro
  • the ability to specify a field of a mutable object as atomic
  • the ability to specify memory orderings to getfield, setfield!, and isdefined (with error checking and codegen optimization)
  • the runtime Intrinsics for doing our standard set of atomic operations on Ptr, plus fences
  • the generic function wrappers for all operations (e.g. atomic_getproperty)
  • doc strings for new functions
  • support for atomics (pointers and fields) of up to 8 bytes in size that are guaranteed to be compatible with C (on some processors, these will imply locks managed by libc), and unlimited sizes of atomics for fields (which will imply the creation of a lock in the object—currently at the start)
  • most @atomic macro special forms and field declarations
  • swapfield!, mutatefield!, and replacefield! functions
  • renamed cmpswap to replace

It does not yet add (which can be future work for a later PR):

  • optimizations (codegen) for the new Ptr Intrinsics
  • optimizations for the swapfield, mutatefield!, and replacefield! functions
  • the lowering support for @atomic specifiers on variables (closure or global)
  • updates to Threads manual (which currently is just the same as the above link)
  • does not safely handle GC during field loads while locked (deadlock hazard) (fixed)
  • model modifyfield!_tfunc better (with op call)

closes #29943, closes #32455, closes #37683

base/Base.jl Outdated Show resolved Hide resolved
base/expr.jl Outdated Show resolved Hide resolved
src/julia-syntax.scm Outdated Show resolved Hide resolved
src/intrinsics.cpp Outdated Show resolved Hide resolved
@JeffBezanson
Copy link
Member

Looks great! It's a lot of stuff, but really pretty straightforward.

src/builtins.c Outdated Show resolved Hide resolved
Keno added a commit that referenced this pull request Nov 24, 2020
On x86 loads have implicit acquire semantics, but the same is
not true on other architectures (e.g. aarch64). We've been
observing test failures on aarch64 as a result (#38555).
Currently, we don't really have a good way to modify fields
in an atomic way (pending #37847), so this just achieves
the same thing using llvmcall to try and get the tests passing.
Keno added a commit that referenced this pull request Nov 25, 2020
On x86 loads have implicit acquire semantics, but the same is
not true on other architectures (e.g. aarch64). We've been
observing test failures on aarch64 as a result (#38555).
Currently, we don't really have a good way to modify fields
in an atomic way (pending #37847), so this just achieves
the same thing using llvmcall to try and get the tests passing.
@vtjnash vtjnash force-pushed the jn/intrinsic-atomics branch 2 times, most recently from 258f06f to 8372038 Compare March 13, 2021 04:39
@vtjnash vtjnash force-pushed the jn/intrinsic-atomics branch from 8372038 to 8242c8d Compare March 27, 2021 14:36
base/docs/basedocs.jl Outdated Show resolved Hide resolved
base/docs/basedocs.jl Outdated Show resolved Hide resolved
src/ast.c Outdated Show resolved Hide resolved
@vtjnash vtjnash force-pushed the jn/intrinsic-atomics branch 3 times, most recently from f56ddb3 to 2b70189 Compare April 3, 2021 03:48
@vtjnash vtjnash force-pushed the jn/intrinsic-atomics branch from 2b70189 to 4c60a8d Compare April 21, 2021 18:38
Copy link
Member

@tkf tkf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for keeping track of the discussion in the Manifesto document. I just went through the docs and have a look at the code a bit.

Misc questions

Release burrier for each constructor

Motivation section of Julia Atomics Manifesto mentions that

This is particularly relevant for ensuring the type-tag of the object is always valid and that the fields of immutable objects (or simply constant fields) always are seen to have their correct value. This implies a release barrier must exist between the end of the constructor and the first time the object is stored to a shared location from a given thread.

Isn't it sufficient to use sequentially consistent ordering for the task enqueue/dequeue? Since any non-atomic data transfer between the tasks are done via the closure stored in the Task which in turn is transferred through the task queue, sequentially consistent store in the task queue seems to be sufficient to guarantee the visibility of the memory operations prior to spawn. It sounds like a much cheaper (presumably free) strategy than requiring a release fence after every constructor which can get in the way of the compiler.

Weak atomics and boxed fields

TODO: however, that still may leave open the question of whether @atomic :relaxed a.x += 1 should get upgraded to include a release barrier if we decide not to allocate x inline.

I was actually writing a comment on exactly this point (although I suppose you need acquire on the read side as well). But I wonder if it should be considered an undefined behavior and implement it as a compile-time error since it likely indicates a misuse of the API. Declaring that this is an undefined behavior leaves us an opportunity to implement auto-promotion afterwards.

(Side note: If we document that this is detected as an error, adding auto-promotion is a breaking change. Thus, we need to not document it or document that this is an undefined behavior if we want to define it later.)

Atomic fields can be read without specifying an ordering (or specifying ":not_atomic"), which will automatically upgrade it to a ":relaxed" read. This is useful for generic code (like show_any).

For the same reason, this is not enough when the field is boxed since the store of the pointer to a heap-allocated object can be reordered with respect to (potentially temporary invariance-breaking) manipulations "prior" to the store.

(But, I don't think we can make something like show_any race-free in general anyway since any thread can be breaking the internal invariance of a mutable object; e.g., adding non-zeros to a sparse array)

API usability for writing lock-free data structures

When implementing a lock-free hash table, I found a couple of atomics use cases that I'm not sure how to fit within the API proposed in this PR. Some more notes to the comments below are in JuliaConcurrent/ConcurrentCollections.jl#4

Multi-field atomics

There was a case where I needed to use atomic load on each field and CAS on two fields. This is for avoiding cmpxchg16b for just loading the fields. Schematically, I think it can be mapped to the API of this PR like this:

mutable struct SixteenBytes
    @atomic a::Int64
    @atomic b::Int64
end
x = SixteenBytes(0, 0)
a = @atomic x.a
b = @atomic x.b
cmpswapfield!(x, (:a, :b), ===, (a, b), (2, 3))  # hypothetical API use

Or is it better/required to declared at the field level (for alignment control)?

mutable struct SixteenBytes
    @union begin  # something like C/C++'s union (hypothetical API)
        @atomic ab::Tuple{Int64,Int64}
    end begin
        @atomic a::Int64
        @atomic b::Int64
    end
end
x = SixteenBytes(0, 0)
a = @atomic x.a
b = @atomic x.b
cmpswapfield!(x, :ab, ===, (a, b), (2, 3))

This API may not be needed for the first version. But I think it's worth start considering to make adding this later easier.

Defining order-generic user function

Some user function defined based on atomics operations can also be used with different or non-atomic orderings. For example, it would be useful in non-atomic batch construction of a hash table. To support such use cases, it would be useful to if the ordering flags were singletons, so that the users don't have to worry about constant propagation heuristics.

It is possible for users to define their own singletons that are manually mapped to the symbols at use sites. However, it does not look like Julia Atomics Manifesto's Alternatives Considered section explicitly considers this pattern and reject it. Although the simplicity argued by the Manifesto is appealing, defining singletons for each code base may introduce cognitive overhead.

Also, I think it's an example where the point mentioned in the Manifesto

Atomic-declared fields can only be written with atomic orderings (and vice versa).

can make it harder to write non-blocking data structures.

However, an unsafe adapter interface that wraps get/set fields can be enough for such use case. It may even be possible to prototype/experiment this outside Base if the representation of the @atomic fields is straightforward enough.

Atomic getindex/setindex!

Future work section of the Manifesto mentions that

However, there are additional complexities with this that make it seem rare a user would be using atomic operations on individual entries of an array, both due to the algorithmic complexity of such usage will likely prefer towards wrapper objects that define locks, and the rate of cache misses would be likely also to kill performance of such an algorithm.

I think adding atomic getindex/setindex! actually enables interesting programming opportunities. For one thing, a concurrent dictionary I implemented already required atomic getindex/setindex!. It would be useful for implementing other array-based lock-free data structures such as array-based queues.

base/Base.jl Outdated Show resolved Hide resolved
Comment on lines +507 to +533
l, r = esc(ex.args[1]), esc(ex.args[2])
return :(getproperty($l, $r, $order))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a guarantee that @atomic call is lowered to a single atomic API call and nothing else? That is to say, disallow invocations such as

@atomic a.b.c
@atomic list.head += list.tail.tail.tail.head
@atomic a.x, b.x = b.x, a.x

...? The asymmetry in the lowering of the last expression is especially concerning to me. I think allowing composite expression can provide miss-impression and potentially promote misuse that assume that everything inside @atomic is done in a transactional manner.

Since allowing more expressions inside the macro is backward compatible (unless the detection of the error is the API guarantee), I think it'd be better to allow lowering to only one call inside @atomic at least as the first version of the API. Also, I think it'd be better to mechanically verify this invariance (= arguments are all symbols or constants).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose so, but I don't generally like to create artificial limits on what expressions people can write. There is just too many holes left behind by static limitations (like syntax) that I didn't consider it worthwhile here. For example, a tuple might be a constant too, or x+1, and so on. And I think that the useful examples are more numerous than that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree something like @atomic a.x += x + 1 and @atomic a.x = (0, 0) can be handy. But I see @atomic as an expert feature so I feel starting from a restrictive API makes sense. Since a line with @atomic will be a very delicate part of the code, making this expression as minimal as possible can promote the readability of the code.

Adding to my example abode, it's also not clear to me what @atomic a.x += 2 * a.x should do. It's conceivable (= I'm not suggesting it) to lower this to

op(x, _) = x + 2 * x
modifyproperty!(a, :x, op, nothing, ...)

which in turn is compiled to a CAS loop.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, that seems a bit too magic to scan for expressions. If you want it, you can write it with anonymous functions:

@atomic a.x = ((x, _) -> x + 2 * x)(a.x, nothing)

Otherwise, you just need to observe a rule that each atomic operation should appear only once in the original expression. Since a.x appeared twice, that expression will happen twice. Both atomically, but twice. (one sequentially_consistent, since it pairs with the store, and the other relaxed, since it doesn't have an ordering, either in the original expression nor with a simultaneous store)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, you just need to observe a rule that each atomic operation should appear only once in the original expression.

My point is that it sounds better to mechanically verify this at the macro expansion time.

Since a.x appeared twice, that expression will happen twice.

Yeah, but, since this is not the case for @atomic a.x, z = y, a.x, I think it's hard to remember and be sure when a.x will be pattern-matched and "fused"

@vtjnash
Copy link
Member Author

vtjnash commented May 11, 2021

It would be better to ask those questions either on the document itself or in the minutes, as Github quickly swallows discussion. However:

Since any non-atomic data transfer between the tasks

No, that is still an atomic data transfer, and probably not the most common way to transfer data.

much cheaper than

I used to have a PR for this, since it is basically free on Intel CPUs.

implement auto-promotion afterwards

Auto-promotion to a stronger ordering is always fine. It happens most of the time in all compilers, since often there isn't a hardware instruction for the actual behavior specified by the standard. It doesn't trigger UB or concerns of UB to do.

this is not enough when the field is boxed since the store of the pointer to a heap-allocated object can be reordered with respect to (potentially temporary invariance-breaking) manipulations "prior" to the store

Those are write-side issues, while here we're discussing read-side behaviors. Your intuition about missing invariants is spot-on, and is why this upgrades it merely to a relaxed read and not sequentially-consistent. I think this is mentioned? The write-side already is mandatory to specify the ordering, so the concerns you mention are already required to be handled. Sure, you can specify them incorrectly, but then that is a problem for all reads and other writes. And precisely how simple we make the write-side defaults is the topic discussed elsewhere, as you noted earlier about the motivation section.\

Multi-field atomics

This seems much too complicated, as this example isn't showing a multifield atomic, but an atomic partial read of a single field. A compiler is legally allowed to avoid reading bytes which aren't observable, so it should be demoting the atomic load+extract into a smaller atomic load automatically. If we see that this optimization isn't happening now, we can implement it later.

it does not look like Julia Atomics Manifesto's Alternatives Considered section explicitly considers this pattern and reject it

Did you mean to say it does consider this and reject it? This is the first bullet point under the first heading in that section.

Atomic-declared fields can only be written with atomic orderings

As far as I know, this requirement is also still currently mandatory for C++ and C also. There was some discussion of relaxing it, e.g. in http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3710.html, since it would permit some algorithms that are currently illegal to specify, but I don't know if any consensus was reached.

Atomic getindex/setindex

Yes, these may be sometimes useful; that is why the document specifies these are out of scope of current discussion, aside from noting that we don't anticipate anything here interfere with or complicating later development of them.

@vtjnash vtjnash force-pushed the jn/intrinsic-atomics branch from 4c60a8d to 1b489d1 Compare May 11, 2021 20:03
@vtjnash vtjnash marked this pull request as ready for review May 11, 2021 21:16
@tkf
Copy link
Member

tkf commented May 12, 2021

Personally, I find it difficult to find old comments in google docs. It also hard to write code, format comments, and include image. If I link back the comments from the google docs, would it solve the discoverability issue? Otherwise, if you prefer inline comment on the google docs anyway, I can move my comments there in the next time around.

much cheaper than

I used to have a PR for this, since it is basically free on Intel CPUs.

This is true for CPU but I believe not for the optimizer which also has to obey the C/C++ (≈ LLVM) memory model. That's why I said "which can get in the way of the compiler." Release fence can impedes the optimizer: https://godbolt.org/z/3n1a5zv46

It happens most of the time in all compilers, since often there isn't a hardware instruction for the actual behavior specified by the standard.

I agree that it's safe to compile to stronger operation in the code the user writes. However, I think promoting @atomic :relaxed a.x += 1 is doing something different. The CAS of the pointer at a.x has a hardware support and there is no need to promote the instruction just for executing the user code (at LLVM IR level) as-is. That is to say, we don't get a lock from this code https://godbolt.org/z/Me1G4cnbf

#include <atomic>

struct S {
    std::atomic<int*> p;
};

void f(S &s, int *p) {
    s.p.store(p, std::memory_order_relaxed);
}

IIUC, the auto-promotion you are describing that all compilers do is based solely on the size of the field (e.g., std::atomic<LargeStruct> a;). On the other hand, the promotion of @atomic :relaxed a.x += 1 discussed here requires to understand the representation of the type of the field.

It doesn't trigger UB or concerns of UB to do.

Just to be clear, I'm not talking about UB on LLVM side, if that's what you are referring to. I was trying to note that if we want to define a behavior of a certain condition X later, we need to declare that it's user's responsibility to avoid such condition X. That is to say, not X has to be (a part of) the precondition of the API. Relaxing the precondition is always valid.

Those are write-side issues, while here we're discussing read-side behaviors.

We can't discuss them separately, right? Atomic operation always concerns both read and write sides (except for relaxed ordering). In this case, don't you need at least acquire on read and release on write?

But if we want safe generic code by default, I don't think there is a good way to handle atomic fields containing pointers. I think it'd be better to ignore such fields in generic code in this case. I agree handling isbitsunion types with relaxed load is still data race-free.

This seems much too complicated, as this example isn't showing a multifield atomic, but an atomic partial read of a single field.

In the sample code, I tried to exemplify two single field ("partial") reads but also single two-field CAS. Two fields are loaded separately to avoid auto-promotion to cmpxchg16b.

Atomic-declared fields can only be written with atomic orderings

As far as I know, this requirement is also still currently mandatory for C++ and C

Doesn't atomic_ref of C++20 let you mix atomic and non-atomic operations on the same location? https://en.cppreference.com/w/cpp/atomic/atomic_ref

Also, Lahav et al. (2017) (which is the basis of C++20 atomics standard) has many examples of valid executions starting with non-atomic write (W^na), e.g.,

image

I think initialization with non-atomic operation is a pretty useful (and valid) optimization. I don't remember if they discuss forbidding non-atomic operations after atomic operations on the same location. It sounds like an unnecessary restriction, as far as the memory model is concerned (it could be nice as an API design though).

@vtjnash
Copy link
Member Author

vtjnash commented May 12, 2021

Release fence can impedes the optimizer

Yes, fences are entirely different beasts and really not at all comparable. Your example code is also missing the branches normally required at that load (null check) and store (write barrier), and thus not relevant to the code fragment that would be affected here. As you noted, this is a TODO item still for later discussion, and not yet finalized for design (and therefore not yet part of this PR).

@atomic :relaxed a.x += 1

If .x is ::Int, then x86 would already convert this to LOCK XADD, IIUC. That LOCK prefix is TSO (total-store-order), which is much stronger than required for "relaxed" (IIRC, it is between sequentially_consistent and acquire_release). Other CPUs won't be this strong. And if .x is ::Any, then it will already be LOCK CMPXCHG. So the question is generally if we should force other platforms and the compiler to match the behavior of x86. But currently I'm avoiding answering that for now, by avoiding promising anything.

don't you need at least acquire on read and release on write

Without a garbage collector, yes, but with a garbage collector, I don't believe that the problematic code pattern should be possible to write.

@vtjnash
Copy link
Member Author

vtjnash commented May 12, 2021

let you mix atomic and non-atomic operations on the same location

No, the page you linked notes that mixing is explicitly prohibited: "While any atomic_ref instances referencing an object exists, the object must be exclusively accessed through these atomic_ref instances"

I think initialization with non-atomic operation is a pretty useful (and valid) optimization

Yeah, I don't think I mentioned this optimization much, though it is implemented here. It should also be a valuable contribution (if not already implemented in LLVM) to weaken atomic declarations to non-atomic when the memory is known not to alias any memory accessible by another thread (for example, during initialization).

base/expr.jl Outdated Show resolved Hide resolved
@tkf
Copy link
Member

tkf commented May 12, 2021

Release fence can impedes the optimizer

Yes, fences are entirely different beasts and really not at all comparable.

My bad, your wording was release barrier. But, store release also impedes the optimizer anyway: https://godbolt.org/z/Masch6e9E

I guess this example is still technically misleading since, IIUC, the compiler is still allowed to move code into the critical sections and use the vectorized instruction if it really wants to. But it'd be strictly disabled if there is a sequentially consistent load of unrelated location in the loop body while such loads alone can still allow the optimization (with the same argument that moving code into the critical sections is OK).

don't you need at least acquire on read and release on write

Without a garbage collector, yes, but with a garbage collector, I don't believe that the problematic code pattern should be possible to write.

GC is great for avoiding problems like ABA but is it enough? For example

x = Dict()
merge!(x, pairs(1:64))  # internally temporary break invariance
@atomic a.x = x

seems to require release store as otherwise we can't be sure x is well-formed on the load side. I don't think GC can help ensuring user-defined constraints.

No, the page you linked notes that mixing is explicitly prohibited: "While any atomic_ref instances referencing an object exists, the object must be exclusively accessed through these atomic_ref instances"

I think the important point is "While any atomic_ref instances referencing an object exists." The documentation also says "For the lifetime of the atomic_ref object, the object it references is considered an atomic object." My reading is that this is allowed:

struct Counters { int a; int b; } counter;
non_atomic_use(counter);
{
    std::atomic_ref<Counters> cnt(counter);
    concurrent_atomic_use(cnt);
}
non_atomic_use(counter);

@vtjnash
Copy link
Member Author

vtjnash commented May 12, 2021

Here's a more realistic example, since the compiler would never emit your version (due to GC): https://godbolt.org/z/8jcefMdc8. Notice that some stores got moved around, but overall it is pretty much identical.

seems to require release store

Yes, it does require a release store (assuming we intended that to be visible to another thread), which is why I've proposed elsewhere to make that the default behavior, even if you are using it improperly. The example code you have there includes a sequentially_consistent store, which would be more than sufficient. This isn't the subject of this PR though, so these comments would be more appropriate on the design manifesto or on the PR that proposes it (#35535) to avoid veering off topic.

My reading is that this is allowed

Yes, re-using the memory location for other purposes is allowed in C/C++, as long as the accesses are not mixed (it is to be assumed concurrent_atomic_use does some sort of thread-join before return, and ensures no copies of cnt outlive that block)

@tkf
Copy link
Member

tkf commented May 14, 2021

Thanks for the example. The optimization I had in mind (sinking store) won't work if there is a nonlined function call like wb anyway. I moved the comments to hackmd.

@vtjnash vtjnash force-pushed the jn/intrinsic-atomics branch 3 times, most recently from 05f8ee1 to 1ebe4a7 Compare May 28, 2021 20:26
@vtjnash vtjnash force-pushed the jn/intrinsic-atomics branch 2 times, most recently from 02ac63b to e247459 Compare May 31, 2021 22:47
@vtjnash vtjnash added this to the 1.7 milestone Jun 1, 2021
@JeffBezanson JeffBezanson added the merge me PR is reviewed. Merge when all tests are passing label Jun 2, 2021
@vtjnash vtjnash force-pushed the jn/intrinsic-atomics branch from e247459 to f313b08 Compare June 2, 2021 19:06
@vtjnash vtjnash merged commit 9f32653 into master Jun 2, 2021
@vtjnash vtjnash deleted the jn/intrinsic-atomics branch June 2, 2021 22:25
@DilumAluthge DilumAluthge removed the merge me PR is reviewed. Merge when all tests are passing label Jun 18, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
multithreading Base.Threads and related functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Atomic operations on array elements Support Threads.Atomic{Ptr}
4 participants