-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add some support for atomic operations on mutable fields and Ptrs #37847
Conversation
Looks great! It's a lot of stuff, but really pretty straightforward. |
On x86 loads have implicit acquire semantics, but the same is not true on other architectures (e.g. aarch64). We've been observing test failures on aarch64 as a result (#38555). Currently, we don't really have a good way to modify fields in an atomic way (pending #37847), so this just achieves the same thing using llvmcall to try and get the tests passing.
On x86 loads have implicit acquire semantics, but the same is not true on other architectures (e.g. aarch64). We've been observing test failures on aarch64 as a result (#38555). Currently, we don't really have a good way to modify fields in an atomic way (pending #37847), so this just achieves the same thing using llvmcall to try and get the tests passing.
258f06f
to
8372038
Compare
8372038
to
8242c8d
Compare
f56ddb3
to
2b70189
Compare
2b70189
to
4c60a8d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for keeping track of the discussion in the Manifesto document. I just went through the docs and have a look at the code a bit.
Misc questions
Release burrier for each constructor
Motivation section of Julia Atomics Manifesto mentions that
This is particularly relevant for ensuring the type-tag of the object is always valid and that the fields of immutable objects (or simply constant fields) always are seen to have their correct value. This implies a release barrier must exist between the end of the constructor and the first time the object is stored to a shared location from a given thread.
Isn't it sufficient to use sequentially consistent ordering for the task enqueue/dequeue? Since any non-atomic data transfer between the tasks are done via the closure stored in the Task
which in turn is transferred through the task queue, sequentially consistent store in the task queue seems to be sufficient to guarantee the visibility of the memory operations prior to spawn. It sounds like a much cheaper (presumably free) strategy than requiring a release fence after every constructor which can get in the way of the compiler.
Weak atomics and boxed fields
TODO: however, that still may leave open the question of whether
@atomic :relaxed a.x += 1
should get upgraded to include a release barrier if we decide not to allocatex
inline.
I was actually writing a comment on exactly this point (although I suppose you need acquire on the read side as well). But I wonder if it should be considered an undefined behavior and implement it as a compile-time error since it likely indicates a misuse of the API. Declaring that this is an undefined behavior leaves us an opportunity to implement auto-promotion afterwards.
(Side note: If we document that this is detected as an error, adding auto-promotion is a breaking change. Thus, we need to not document it or document that this is an undefined behavior if we want to define it later.)
Atomic fields can be read without specifying an ordering (or specifying ":not_atomic"), which will automatically upgrade it to a ":relaxed" read. This is useful for generic code (like
show_any
).
For the same reason, this is not enough when the field is boxed since the store of the pointer to a heap-allocated object can be reordered with respect to (potentially temporary invariance-breaking) manipulations "prior" to the store.
(But, I don't think we can make something like show_any
race-free in general anyway since any thread can be breaking the internal invariance of a mutable object; e.g., adding non-zeros to a sparse array)
API usability for writing lock-free data structures
When implementing a lock-free hash table, I found a couple of atomics use cases that I'm not sure how to fit within the API proposed in this PR. Some more notes to the comments below are in JuliaConcurrent/ConcurrentCollections.jl#4
Multi-field atomics
There was a case where I needed to use atomic load on each field and CAS on two fields. This is for avoiding cmpxchg16b
for just loading the fields. Schematically, I think it can be mapped to the API of this PR like this:
mutable struct SixteenBytes
@atomic a::Int64
@atomic b::Int64
end
x = SixteenBytes(0, 0)
a = @atomic x.a
b = @atomic x.b
cmpswapfield!(x, (:a, :b), ===, (a, b), (2, 3)) # hypothetical API use
Or is it better/required to declared at the field level (for alignment control)?
mutable struct SixteenBytes
@union begin # something like C/C++'s union (hypothetical API)
@atomic ab::Tuple{Int64,Int64}
end begin
@atomic a::Int64
@atomic b::Int64
end
end
x = SixteenBytes(0, 0)
a = @atomic x.a
b = @atomic x.b
cmpswapfield!(x, :ab, ===, (a, b), (2, 3))
This API may not be needed for the first version. But I think it's worth start considering to make adding this later easier.
Defining order-generic user function
Some user function defined based on atomics operations can also be used with different or non-atomic orderings. For example, it would be useful in non-atomic batch construction of a hash table. To support such use cases, it would be useful to if the ordering flags were singletons, so that the users don't have to worry about constant propagation heuristics.
It is possible for users to define their own singletons that are manually mapped to the symbols at use sites. However, it does not look like Julia Atomics Manifesto's Alternatives Considered section explicitly considers this pattern and reject it. Although the simplicity argued by the Manifesto is appealing, defining singletons for each code base may introduce cognitive overhead.
Also, I think it's an example where the point mentioned in the Manifesto
Atomic-declared fields can only be written with atomic orderings (and vice versa).
can make it harder to write non-blocking data structures.
However, an unsafe adapter interface that wraps get/set fields can be enough for such use case. It may even be possible to prototype/experiment this outside Base if the representation of the @atomic
fields is straightforward enough.
Atomic getindex/setindex!
Future work section of the Manifesto mentions that
However, there are additional complexities with this that make it seem rare a user would be using atomic operations on individual entries of an array, both due to the algorithmic complexity of such usage will likely prefer towards wrapper objects that define locks, and the rate of cache misses would be likely also to kill performance of such an algorithm.
I think adding atomic getindex
/setindex!
actually enables interesting programming opportunities. For one thing, a concurrent dictionary I implemented already required atomic getindex
/setindex!
. It would be useful for implementing other array-based lock-free data structures such as array-based queues.
l, r = esc(ex.args[1]), esc(ex.args[2]) | ||
return :(getproperty($l, $r, $order)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have a guarantee that @atomic
call is lowered to a single atomic API call and nothing else? That is to say, disallow invocations such as
@atomic a.b.c
@atomic list.head += list.tail.tail.tail.head
@atomic a.x, b.x = b.x, a.x
...? The asymmetry in the lowering of the last expression is especially concerning to me. I think allowing composite expression can provide miss-impression and potentially promote misuse that assume that everything inside @atomic
is done in a transactional manner.
Since allowing more expressions inside the macro is backward compatible (unless the detection of the error is the API guarantee), I think it'd be better to allow lowering to only one call inside @atomic
at least as the first version of the API. Also, I think it'd be better to mechanically verify this invariance (= arguments are all symbols or constants).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose so, but I don't generally like to create artificial limits on what expressions people can write. There is just too many holes left behind by static limitations (like syntax) that I didn't consider it worthwhile here. For example, a tuple might be a constant too, or x+1
, and so on. And I think that the useful examples are more numerous than that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree something like @atomic a.x += x + 1
and @atomic a.x = (0, 0)
can be handy. But I see @atomic
as an expert feature so I feel starting from a restrictive API makes sense. Since a line with @atomic
will be a very delicate part of the code, making this expression as minimal as possible can promote the readability of the code.
Adding to my example abode, it's also not clear to me what @atomic a.x += 2 * a.x
should do. It's conceivable (= I'm not suggesting it) to lower this to
op(x, _) = x + 2 * x
modifyproperty!(a, :x, op, nothing, ...)
which in turn is compiled to a CAS loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, that seems a bit too magic to scan for expressions. If you want it, you can write it with anonymous functions:
@atomic a.x = ((x, _) -> x + 2 * x)(a.x, nothing)
Otherwise, you just need to observe a rule that each atomic operation should appear only once in the original expression. Since a.x
appeared twice, that expression will happen twice. Both atomically, but twice. (one sequentially_consistent, since it pairs with the store, and the other relaxed, since it doesn't have an ordering, either in the original expression nor with a simultaneous store)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise, you just need to observe a rule that each atomic operation should appear only once in the original expression.
My point is that it sounds better to mechanically verify this at the macro expansion time.
Since
a.x
appeared twice, that expression will happen twice.
Yeah, but, since this is not the case for @atomic a.x, z = y, a.x
, I think it's hard to remember and be sure when a.x
will be pattern-matched and "fused"
It would be better to ask those questions either on the document itself or in the minutes, as Github quickly swallows discussion. However:
No, that is still an atomic data transfer, and probably not the most common way to transfer data.
I used to have a PR for this, since it is basically free on Intel CPUs.
Auto-promotion to a stronger ordering is always fine. It happens most of the time in all compilers, since often there isn't a hardware instruction for the actual behavior specified by the standard. It doesn't trigger UB or concerns of UB to do.
Those are write-side issues, while here we're discussing read-side behaviors. Your intuition about missing invariants is spot-on, and is why this upgrades it merely to a relaxed read and not sequentially-consistent. I think this is mentioned? The write-side already is mandatory to specify the ordering, so the concerns you mention are already required to be handled. Sure, you can specify them incorrectly, but then that is a problem for all reads and other writes. And precisely how simple we make the write-side defaults is the topic discussed elsewhere, as you noted earlier about the motivation section.\
This seems much too complicated, as this example isn't showing a multifield atomic, but an atomic partial read of a single field. A compiler is legally allowed to avoid reading bytes which aren't observable, so it should be demoting the atomic load+extract into a smaller atomic load automatically. If we see that this optimization isn't happening now, we can implement it later.
Did you mean to say it does consider this and reject it? This is the first bullet point under the first heading in that section.
As far as I know, this requirement is also still currently mandatory for C++ and C also. There was some discussion of relaxing it, e.g. in http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2013/n3710.html, since it would permit some algorithms that are currently illegal to specify, but I don't know if any consensus was reached.
Yes, these may be sometimes useful; that is why the document specifies these are out of scope of current discussion, aside from noting that we don't anticipate anything here interfere with or complicating later development of them. |
4c60a8d
to
1b489d1
Compare
Personally, I find it difficult to find old comments in google docs. It also hard to write code, format comments, and include image. If I link back the comments from the google docs, would it solve the discoverability issue? Otherwise, if you prefer inline comment on the google docs anyway, I can move my comments there in the next time around.
This is true for CPU but I believe not for the optimizer which also has to obey the C/C++ (≈ LLVM) memory model. That's why I said "which can get in the way of the compiler." Release fence can impedes the optimizer: https://godbolt.org/z/3n1a5zv46
I agree that it's safe to compile to stronger operation in the code the user writes. However, I think promoting #include <atomic>
struct S {
std::atomic<int*> p;
};
void f(S &s, int *p) {
s.p.store(p, std::memory_order_relaxed);
} IIUC, the auto-promotion you are describing that all compilers do is based solely on the size of the field (e.g.,
Just to be clear, I'm not talking about UB on LLVM side, if that's what you are referring to. I was trying to note that if we want to define a behavior of a certain condition X later, we need to declare that it's user's responsibility to avoid such condition X. That is to say, not X has to be (a part of) the precondition of the API. Relaxing the precondition is always valid.
We can't discuss them separately, right? Atomic operation always concerns both read and write sides (except for relaxed ordering). In this case, don't you need at least acquire on read and release on write? But if we want safe generic code by default, I don't think there is a good way to handle atomic fields containing pointers. I think it'd be better to ignore such fields in generic code in this case. I agree handling isbitsunion types with relaxed load is still data race-free.
In the sample code, I tried to exemplify two single field ("partial") reads but also single two-field CAS. Two fields are loaded separately to avoid auto-promotion to
Doesn't Also, Lahav et al. (2017) (which is the basis of C++20 atomics standard) has many examples of valid executions starting with non-atomic write (W^na), e.g., I think initialization with non-atomic operation is a pretty useful (and valid) optimization. I don't remember if they discuss forbidding non-atomic operations after atomic operations on the same location. It sounds like an unnecessary restriction, as far as the memory model is concerned (it could be nice as an API design though). |
Yes, fences are entirely different beasts and really not at all comparable. Your example code is also missing the branches normally required at that load (null check) and store (write barrier), and thus not relevant to the code fragment that would be affected here. As you noted, this is a TODO item still for later discussion, and not yet finalized for design (and therefore not yet part of this PR).
If
Without a garbage collector, yes, but with a garbage collector, I don't believe that the problematic code pattern should be possible to write. |
No, the page you linked notes that mixing is explicitly prohibited: "While any atomic_ref instances referencing an object exists, the object must be exclusively accessed through these atomic_ref instances"
Yeah, I don't think I mentioned this optimization much, though it is implemented here. It should also be a valuable contribution (if not already implemented in LLVM) to weaken atomic declarations to non-atomic when the memory is known not to alias any memory accessible by another thread (for example, during initialization). |
My bad, your wording was release barrier. But, store release also impedes the optimizer anyway: https://godbolt.org/z/Masch6e9E I guess this example is still technically misleading since, IIUC, the compiler is still allowed to move code into the critical sections and use the vectorized instruction if it really wants to. But it'd be strictly disabled if there is a sequentially consistent load of unrelated location in the loop body while such loads alone can still allow the optimization (with the same argument that moving code into the critical sections is OK).
GC is great for avoiding problems like ABA but is it enough? For example x = Dict()
merge!(x, pairs(1:64)) # internally temporary break invariance
@atomic a.x = x seems to require release store as otherwise we can't be sure
I think the important point is "While any struct Counters { int a; int b; } counter;
non_atomic_use(counter);
{
std::atomic_ref<Counters> cnt(counter);
concurrent_atomic_use(cnt);
}
non_atomic_use(counter); |
Here's a more realistic example, since the compiler would never emit your version (due to GC): https://godbolt.org/z/8jcefMdc8. Notice that some stores got moved around, but overall it is pretty much identical.
Yes, it does require a release store (assuming we intended that to be visible to another thread), which is why I've proposed elsewhere to make that the default behavior, even if you are using it improperly. The example code you have there includes a sequentially_consistent store, which would be more than sufficient. This isn't the subject of this PR though, so these comments would be more appropriate on the design manifesto or on the PR that proposes it (#35535) to avoid veering off topic.
Yes, re-using the memory location for other purposes is allowed in C/C++, as long as the accesses are not mixed (it is to be assumed concurrent_atomic_use does some sort of thread-join before return, and ensures no copies of |
Thanks for the example. The optimization I had in mind (sinking store) won't work if there is a nonlined function call like |
05f8ee1
to
1ebe4a7
Compare
02ac63b
to
e247459
Compare
e247459
to
f313b08
Compare
This starts much of the work of implementing Julia Atomics Manifesto for fields. In particular, it adds:
@atomic
macrogetfield
,setfield!
, andisdefined
(with error checking and codegen optimization)atomic_getproperty
)@atomic
macro special forms and field declarationsswapfield!
,mutatefield!
, andreplacefield!
functionscmpswap
toreplace
It does not yet add (which can be future work for a later PR):
swapfield
,mutatefield!
, andreplacefield!
functions@atomic
specifiers on variables (closure or global)updates to Threads manual (which currently is just the same as the above link)does not safely handle GC during field loads while locked (deadlock hazard)(fixed)closes #29943, closes #32455, closes #37683