-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Annotate Julia reference accesses for atomic release/consume #35535
Conversation
I don't think we should do this. This deviate from the model that LLVM is optimized for. Also, I think even though this ordering is OK for normal load/store, I believe it is not very well defined/specified for new instructioins (even SSE) The added compiler barrier is almost certainly going to kill vectorization. |
This is a (slightly) weakened form of the Java memory model (since it's only applied to pointers, not values), which LLVM has been expected to handle. And the new barrier on the stores would never have vectorized anyways (because of the required gc barrier). This code shows a mildly unfortunate consequence:
Which reveals a missed vectorization opportunity inside LLVM this causes. But this specific code also feels extremely rare and rather useless.
That is true of our current naive emission. This should fix it based on the x86-TSO (total store order) model. See research at https://www.cl.cam.ac.uk/~pes20/weakmemory/index.html for x86, ARM, and PPC that discuss why this is valid. |
Also the relevant page in the intel manual is available at |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a (slightly) weakened form of the Java memory model
Does Java have automatically memory barrier? I thought they only have that for volatile (atomic).
And the new barrier on the stores would never have vectorized anyways (because of the required gc barrier).
OK, this is fair.
That is true of our current naive emission. This should fix it based on the x86-TSO (total store order) model. See research at https://www.cl.cam.ac.uk/~pes20/weakmemory/index.html for x86, ARM, and PPC that discuss why this is valid.
Actually I looked again, it is the atomicity that is not specified, not the ordering. I assume this means that read of atomic_order_relaxed
can't be vectorized. Also I don't think memcpy
, memmove
have that kind of guarantee either. (And I'm pretty sure some implementations use string operations, which isn't even ordered according to intel).
My main point is that almost all the code from llvm to libc are written to use all single thread optimizations that doesn't introduce race. Requiring a stronger memory model than that will need to handle/disable all of the related optimizations/assumptions.
@@ -459,7 +459,8 @@ static Value *literal_pointer_val(jl_codectx_t &ctx, jl_value_t *p) | |||
return literal_static_pointer_val(p); | |||
Value *pgv = literal_pointer_val_slot(ctx, p); | |||
return tbaa_decorate(tbaa_const, maybe_mark_load_dereferenceable( | |||
ctx.builder.CreateLoad(T_pjlvalue, pgv), false, jl_typeof(p))); | |||
ctx.builder.CreateAlignedLoad(T_pjlvalue, pgv, sizeof(void*)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these addition of alignment related? I thought this is the default alignment anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Atomic instructions are required to always specify alignment, so I was just uniformly examining everything already. LLVM also just now mostly prefers that we specify it specifically always, rather than depending on TargetData ("code will be correct, but may be inefficient" https://llvm.org/docs/Frontend/PerformanceTips.html#when-to-specify-alignment).
Type *ptrty = PointerType::get(elty, ptr->getType()->getPointerAddressSpace()); | ||
if (ptr->getType() != ptrty) | ||
ptr = ctx.builder.CreateBitCast(ptr, ptrty); | ||
if (idx_0based) | ||
ptr = ctx.builder.CreateInBoundsGEP(r->getType(), ptr, idx_0based); | ||
Instruction *store = ctx.builder.CreateAlignedStore(r, ptr, isboxed || alignment ? alignment : julia_alignment(jltype)); | ||
if (!isboxed && CountTrackedPointers(elty).count) | ||
ctx.builder.CreateFence(AtomicOrdering::Release); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, I don't think the fence can replace atomic here. (also more places in C code). It "improves" the ordering but not atomicity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're confusing it with sequential consistency fences—that cannot be adequately defined as a fence, since it needs to be a property of the load/store operation itself. The release fence is documented to generally be okay here, and in practice is always okay on hardware.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, what I said is that as long as the store is not atomic LLVM is allowed to do whatever it want including making the store non-atomic.
src/array.c
Outdated
@@ -1219,6 +1221,7 @@ JL_DLLEXPORT void jl_array_ptr_copy(jl_array_t *dest, void **dest_p, | |||
} | |||
} | |||
memmove(dest_p, src_p, n * sizeof(void*)); | |||
jl_fence_release(); // to finish up, ensure contents of dest is now visible on other processors too |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And the missing atomicity aside, I think the fence needs to go before all stores to dest
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, I should stop using memmove also, and switch to llvm.memmove.element.unordered.atomic
to be correct. This fence just means you can use a different (relaxed) variable as a flag to indicate the operation is done.
There is also a fence added at the start, to ensure that the pointers consume-loaded in this loop will also be valid to dereference (consume-load) on another thread.
yes, llvm calls this
Yep, that's what this PR adds/changes 650 lines to deal with, as a start. |
Unordered is the issue I have elsewhere but that's not what I was talking about here. I mean the store must have a release order.
Right, I know, the point is that those have to add unacceptable amount of performance penalty. Of course if you just look at the instructions it will seem fine and I agree if the compiler can do no optimization that'l be the case. The point is that these are meant to disable a lot of single thread optimizations (due to the write barrier this will mostly affect read of course, though given the direction vector instructions are heading towards, I won't be surprised at all if the write barrier can be vectorized with some tweak). Some of the changes here already give a clear example of such penalty and I believe there are more cases for architectures that are not considered important/tested enough like AVX512 and SVE, both of which have scatter/gather. It seem to me that people working on wide instructions cares little about using those for synchronization and rightly so and it'll be a similar story for the compiler as well. Unless there's a guarantee that all future instruction extensions and compiler optimizations can support this well, I don't think it is a good move to kill all of them in a user visible way now. |
3e9d9af
to
61d8d44
Compare
@nanosoldier |
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan |
@nanosoldier |
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan |
doc/src/manual/parallel-computing.md
Outdated
Additionally, Julia automatically will ensure memory safety in the presence of | ||
some types of data-races. | ||
|
||
When publishing a boxed value to a global global, that value can be read on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When publishing a boxed value to a global global, that value can be read on | |
When publishing a value to a global, that value can be read on |
doc/src/manual/parallel-computing.md
Outdated
|
||
When publishing a boxed value to a global global, that value can be read on | ||
another thread without additional synchronization. However, do not assume that | ||
observing one value is set transitively implies anything about other values, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
observing one value is set transitively implies anything about other values, | |
observing one value implies anything about other values, |
src/toplevel.c
Outdated
@@ -32,6 +32,7 @@ JL_DLLEXPORT int jl_lineno = 0; // need to update jl_critical_error if this is T | |||
JL_DLLEXPORT const char *jl_filename = "none"; // need to update jl_critical_error if this is TLS | |||
|
|||
htable_t jl_current_modules; | |||
jl_mutex_t jl_modules_mutex; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's split this addition into a separate commit
This will only change the default behavior. The next step is still to add instructions that let the user specify the atomic model explicitly, allowing them to request the stronger seq_cst or weaker non-atomic models where required. |
But then why should we change the default? I can see argument for either having non-atomic (what everyone else does) or full global seq_cst as default since one is what everything is built on top and has no effect on the majority of the code whereas seq_cst is could be the default when you want to use thread synchronization since that's the only one that an average person can intuitively reason about. This PR is in some wierd place in between. You get neither fast nor safe code by default. The behavior cannot be reasoned easily and you get performance hit everywhere. Put it in another way, there are two different types of code,
All well written programs, multiple threaded or not, will have the majority of the code excuting non-synchrnoization (i.e. first kind). If not, you should get rid of the thread and remove all the synchronization since it's just wasting time. Because of that, the code most people need most of the time is going to have no atomics. If this was not the default, almost every single function everyone ever written will need a default version and a single thread version and I don't think that's practical. Now, ideally, whatever the default is, one would want every function to support all possible memory orderings. Since that's clearly impractical (exccess code generation if nothing else) we should put the burden of missing function on the use cases that doesn't need as much. Even without all the functions having a strong memory ordered version, one can still usually write slower but correct code of the second kind using simpler synchrinization preemptives, which is the main thing that matters for that kind of code. OTOH, without all the functions having a non-atomic version, it is impossible to write performant code of the first kind which is the main goal of that kind. |
Because it's not. In particular, the second most popular language in the world uses this model (https://redmonk.com/sogrady/2020/02/28/language-rankings-1-20/?utm_source=rss&utm_medium=rss&utm_campaign=language-rankings-1-20).
Because it's massively expensive everywhere, so mostly only python does this.
It's the x86 model. It's better than we do now, and not as understandable as seq_cst. But it also has no performance cost on x86. That's why I ran the benchmarks—to show this doesn't affect them. |
Everyone as in everyone that can decide after things have converged and also where performance matters.
No it's not because no one is coding against the hardware, you are coding against the compiler model. It is not the model you see when you write code for a x86 machine.
And that's exactly why I mentioned before you even run the benchmarks that they won't make sense here. In particular, are you running the benchmarks on a computer that has high performance scatter gather (with predicate) and are you running benchmarks where those matters? You already showed that Also, I don't think it's OK to be this x86 centric. If something already work on x86 and could exist elsewhere it is certainly OK to have it. However, it is wrong to have everyone emulate x86. |
Post a benchmark? I would also like to see a test case that crashes or otherwise misbehaves (presumably on non-x86) without this change. Then we'll be better able to assess the trade-off here. |
In general, this disables vectorizations on load of references so it'll mainly show up on x86 when optimiztions on such operation has significant effect. Other than NULL check, a load of the reference can only vectorize with scatter/gatter and unless the field can't be As a manual version of that, as I already mentioned, no normal IIRC, it may also affect forwarding (two loads on the same field can't be be merged into one) but I'm less sure about that and it'll certainly harder to come up with a standalone test case where it affect performance significantly (It may still affect other performance but a case where multiple optimizations interacts with each other is much harder to come up with). When not limited to x86 or current optimization limit. I have done benchmark before that the release store is much slower on arm/aarch64 than a normal store, easily slower than our write barrier which is just a (hopefully predictable) branch... As mentioned above, I also don't see why our write barrier can't be vectorized using scatter/gatter. It requires much better LLVM vectorizer integration and a vectorized root queueing function for sure but it should be doable but won't be anymore with this change. Generally speaking, again as already mentioned above, almost none of the new vectorize feature are for synchronization and adding this will disable all of those. AFAICT even avx512 usage seems to be just getting started..... As for a counter example, see https://en.cppreference.com/w/cpp/atomic/memory_order#Release-Consume_ordering for the C++ example. It'll be very hard to produce a wrong result, especially on x86 since the naive codegen will easily give stronger guaruantee. You need to convince the optimizer to reorder instructions to produce a program that misbehave on x86 which can easily give the illusion that this is a small change from what we already have. It should be much easier to do on arm/ppc (as well as the performance hit). |
Can you provide numbers? I'd like to know what the actual slowdown is. Non-x86 is fine. |
Tested on a cortex-a9, which is the only hardware mentioned above that I currently have access to. Normal store takes 5 ns, release store takes 16 ns. I don't have a avx512 machine to test on but you can assume whatever difference from vectorzation. |
(I'm still a noob when it comes to the memory model, but I think there might be other relevant points, on top of what yuyichao has been mentioning.) As a user, I'd appreciate it if the subtle examples whose behavior would be defined by this PR are actually left undefined, so that the compiler (or tools like TSAN?) can actually throw an error or warning when finding patterns like this. If the behavior is well-defined, it cannot be turned to an error because that would break existing code. Furthermore, I'd prefer to read code that is explicit about memory ordering especially when it's doing something tricky. Even after this PR, I suspect the best thread programming practice in Julia would be to avoid the implicit default ordering (even when that's exactly what you need) as that would be hard to notice when it's embedded in an innocent-looking piece of code. If that's the case, the only case that the behavior defined by this PR would be used would likely be by accidents. However, the compiler can't tell the user that there is a "code smell" because the behavior is actually well-defined. In other words, defining this behavior is not free, even in x86. The cost is the ability to tell the programming errors to the users. Here are some examples by @vtjnash who helped me understanding this PR in slack: function f1()
local x
@sync begin
@spawn begin
while !isdefined(x)
end
println(typeof(x))
end
@spawn x = inferencebarrier(12345678) + 1
end
end
function f2()
local x = 1
@sync begin
@spawn while true
println(typeof(x))
end
@spawn while true
x += 1
end
end
end This one is mine but @vtjnash confirmed that this would also be well-defined: struct MutableStruct
a::Any
end
function f3()
local x
@sync begin
@spawn begin
while !isdefined(x)
end
println(x.a)
end
@spawn x = MutableStruct(1)
end
end (Just sharing mostly for myself so that it'd be easy to look up later.) (Edit: fixed typos |
So essentially nobody? I didn't just make this stuff up. Of the top languages e.g. (https://redmonk.com/sogrady/2020/02/28/language-rankings-1-20/?utm_source=rss&utm_medium=rss&utm_campaign=language-rankings-1-20), many use this model or stronger (single threaded / seq_cst): Java does https://docs.oracle.com/javase/specs/jls/se7/html/jls-17.html#jls-17.4 I'd posit that Rust, C, and C++ are actually the outliers (and I give rust a bit of a pass due to its linear type system, since you can't write the operation if it'd be an implied data race).
Would be interesting to know about a newer machine too, since ARMv8 added the required instruction (STLR) in 2011 to the ISA. While it may still be micro-coded the same way in some CPUs, it has the ability now to be implemented faster. |
Do they define such ordering semantics even without qualifiers/annotations like |
Single threaded is exactly what the default model should be. The point is exactly about not paying the panenty by default. And I'll also say that many of the languages in that list aren't something to compare the performance with. It is a no brainer that having a overly strong memory model will make multithread behavior easier to understand which will certainly make the language more popular if performance isn't even a major concern.
That was what I have tested, on cortex-a57. I don't have the access now but I'm pretty sure it is also at least 2x slower if not more. |
Yes. While not explicitly required by the ECMA-334 C# standard, it is likely the cheapest way to implement the requirements of section 15.5.5, and is known to be the current implementation. |
Which is exactly why I was talking about scatter gather. That's how the vectorize can have a bigger effect and also would be more useful. It's also the area that the optimization or even the hardware isn't as mature as it would be so any current benchmark would only be a lower bound on how bad it would be in the future.
Can you give a link? Not that it's very important here though since it's not part of the stack we use nor does it have very similar performance goal afaict.
I don't have your version to test but the c version I've tested shows that a relaxed atomic load correctly disables the hoist. |
Is this the latest ECMA-334 C# standard? https://www.ecma-international.org/publications/files/ECMA-ST/ECMA-334.pdf#page=308 Quoting:
But we don't have this strict semantics (always initialized to default value) even in the current Julia used with a single thread, right? |
We have stricter semantics than this actually, though you can opt out of it partially. But in single-threaded code, we ensure you can't even see the object not only before the default initialization (which we do have), but also before construction is finished. |
Okay, but let's talk about the actual PR proposal here. C does not include a specification equivalent to the java/C# memory model, so it's not useful as a test platform of this PR. Though with TBAA, clang theoretically could still prove that hoisting was valid, I don't expect that as being particularly productive to write such an optimization analysis into that pass. |
61d8d44
to
c7d13de
Compare
If C doesn't have that then neither is LLVM, and this is not even talking about the other cases that is breaking the atomicity in the runtime C code. Without having at least as strong as relaxed load I do not believe LLVM can actually provide such a guarantee. Sure the generated code might happen to work on x86 or even arm since the compiler isn't able to find a way to break it but that's definitely not how one should write multithreaded code. Also, TBAA has nothing to do with the optimization here. An atomic load within a loop, with TBAA or not is a synchronization and it is well defined to write to that location from another thread. OTOH, hoisting that load may cause the program to never termiate which is undefined so the compiler is not allowed to do that. |
OK, I'm thinking this ( By the way, I find an interesting example in CppCon 2016: Hans Boehm “Using weakly ordered C++ atomics correctly" - YouTube (around 31:35) which is close to the So, from the same reasoning, does this mean struct MutableStruct
a::Any
end
function f4()
local x
y = MutableStruct(1)
@sync begin
@spawn begin
while !isdefined(x)
end
if x === y
println(x.a)
end
end
@spawn x = y
end
end can print something other than 1? |
Quoth the standard: "[this ordering has] no synchronization or ordering constraints"
Quoth the langref: "[this ordering] is intended to provide a guarantee strong enough to model Java’s non-volatile shared variables". It has no equivalent expression in the C memory model.
Since using that part of Java's memory model is what I'm proposing here, I'm relatively confident that it's not just happening to work, but is working exactly as per design. |
FWIW, that's only partially opting out. It's impossible to fully opt out (which would imply not even setting the type tag, and would definitely wreak havoc with single-threaded use)
Not with this PR (since that's always the value it'd get regardless of which pointer it followed), though it's dangerously close to being true if you make some small changes. Though it is also permitted (expected?) never to return at all (that the while loop shouldn't synchronize, so post-optimization it shouldn't stop). Here's the small change I think you need to make this example interesting:
So even though |
I didn't realize However, what's the documented behavior of the "java memory model"? From the description of this ordering here it does not mention anything that'll make it a consume ordering. In another word, reading through the pointer requires a stronger ordering than relaxed (llvm
I believe that's why most of the examples usually don't use a loop....
Which means that even though the boxed value was "published to a global variable" (btw the document says "global global") the value it puslished cannot be read. You said in the top comment that
and I believe consume ordering WILL have that guarantee (and will be stronger than relaxd and will disable vectorization and a bunch of other single thread optimization). It is the consume ordering that I don't think we should have by default so if that's not the case (in particular if the new ordering allows almost all of the single thread optimizations) then it'll certainly be more likely to be OK. If that's the case, I think the document and that comment should be updated and there should be a link or copy of the precise model you want. (FWIW, even if it is consume ordering such a link or copy is still necessary but it just isn't as necessary during the review for me since I know where to find the C++ version) |
Correct—it's not consume ordering on all processors, just all of the ones we'd ever care about (the exception is the Alpha).
Consume ordering is not well-defined presently, due partly to the counterexample you gave. I don't know what their thinking is yet for correcting it. I'd perhaps suggest that we treat these atomics like their namesakes, electrons, and claim that interacting them results in an entanglement and a superposition of their properties, haha.
It's documented in JSR-133 here https://docs.oracle.com/javase/specs/jls/se7/html/jls-17.html. I think the mc in Java 17.5.1. (Semantics of final Fields) is a specification of a consume load if the data is not later changed (which introduces a second data race, and causes the problems discussed above). I'll update my documentation to clarify that this guarantee only applies to immutable objects and their type tags. That also will at least make it so that the |
Well, if it is not consume ordering than it is not anywhere. Relying on the hardware behavior assuming some compiler limitation is not a ordering. Counterexample above
I didn't think I gave an counter example. If you meant the one you and @tkf gave above e.g. #35535 (comment) and #35535 (comment) than it is very well defined for consume ordering I believe. In fact, I believe both the definition of the ordering and the hardware behavior for naive codegen guarantees that everything in the branch can observe everything released by the synchronizing store. The branch itself, not just reading through the pointer counts as a dependency on the consume load. The only orderings where the result is not guaranteed to see the change is relaxed or weaker.
and even explicitly mention that (as you've quoted)
AFAICT, a java frontend might as well lower load from a final field as an acquired load (with everything else being unordered) and that'll satisfy this requirement just fine. This can for sure also be implemented using much more expensive synchroization during finalization of the object. In another word, just from the document presented, it's clear that for a final field in java the read through pointer is well defined. I can see how that could be implemented efficiently with a custom compiler but I don't see a clear and well defined mapping to LLVM. AFAICT without an consume or acquire load LLVM will not guarantee on the dependency being honored. There should be some document on how this mapping can be done, what if any assumption (outside of standard C memory model at least) it is making and how are those assumptions guaranteed. |
Consume ordering was removed from the standard in C++17 due to problems with the original attempt at defining it and the pitfalls it entailed, discussion at https://lwn.net/Articles/586838/ and presumably in the YouTube video.
Only if you ignore the rest of the document
From the discussion here, seems like the issue is on you not reading the LLVM LangRef, and not on the lack of information therein. I know I initially related this to the C11 consume ordering, though I've since intended to clarify that to claim instead the Java memory model for this PR. That is also literally exactly what the LangRef says this does ("This is intended to provide a guarantee strong enough to model Java's [requirements]"). Why are you claiming that this won't do what Java intends and that Java needs to use something else than this? ("[The LangRef needs to] document on how this mapping [from Java to LLVM] can be done") |
It seems to be still inn the sandard. It is being revised and that's why I mentioned the hardware behavior on the branch being a dependency which should be the behavior that the standard try to match.
Hhmmm, what I said is that
which is definately stronger than everything else mentioned here. If this cannot satisfy the requirement I don't see what can. Which other part of the document that I'm ignoring? (I mean, sure there are other types that I'm not talking about and I'm not implying that a volaltile/atomic load can be implemented as unordered. I just don't see why lowering this one as acquired will violate anything)
It's because I don't want to read from the document more than what it says explicitly. I fully agree that the unordered ordering could/should be used to match part of the java memory model that doesn't exist in C. I don't believe the LLVM document implies that this ordering is all what you need to implement all different types of loads in java (it's missing atomics at the least). And that's why I'm claiming that if it is intended to provide more than what what the document says those should also be explicitly stated. For the record, other than
which AFAICT does not talk about
and
Another reason this cannot possibly be the whole story is that a load ordering has to match with a store ordering. That's why the C++ document metions all ordering between relaxed and seq-cst in pairs. |
The LangRef says that this explicitly provides the java memory model. I realize that is nearly the only thing it says. But claiming it might not be useful for java is utter nonsense when it's explicitly defined for that exact purpose. The LangRef also explains it in slightly more detail in the linked document: https://llvm.org/docs/Atomics.html
Are you ever going to read the document you keep quoting? They are in fact also explicitly covered a few paragraphs later in the LangRef: "This corresponds to...Java volatile" (In Java's vocabulary,
Yeah, there's recent literature that talks about how this is a deficiency in the standard, since it doesn't accurately reflect the behavior of hardware. I read a lot of it last week, and can pass along links, but basically it boils down to saying that the standard is not the whole story, since reasonable possible physical implementations ends up being different from it. |
When did I say this? I said I believe it can be used for java, but it won't be the only thing java needs.
Yes I did and yes I did later realized that atomic isn't a good example since the document did specify "shared variable" which can probably at least rule out volatiles. It was just the few special memory ordering tools from java that I know of. OTOH, I am correct to say that this (
That's not at all what I mean. What I meant is
|
22bab6e
to
89462f8
Compare
Update: from various external discussions, I think the current proposal for this PR is to change most of the
In Java, "shared" means anything non-local. It's orthogonal to volatile.
In Java, every object effectively has at least one final field (the typeof tag), thus it's required for all objects.
True, it's then something implementation defined. For all except Alpha, this gives us the proposed release/consume for all operations that occurred in the constructor. |
Triage call's conclusion still is to update this PR to change most of the release stores to unordered (or relaxed) and keep the changes to the loads as herein. |
The read may still be torn, as that still requires fixes elsewhere, so this documentation is not yet actually correct.
This lets us say that anytime you observe the store of a pointer, it's valid to assume consume ordering on that read (on all processors except the Alpha). It's also already the default behavior on the x86 hardware, this just enforces the ordering on the compiler too.
And add load/store alignment annotations, because LLVM now prefers that we try to specify those explicitly, even though it's not required. This does not yet include correct load/store behaviors for objects with inlined references (the recent #34126 PR).
89462f8
to
4a50693
Compare
This lets us say that anytime you observe the store of an object reference (for example, by reading a global, or a pointer field of an object, or a pointer array), that it is valid to assume consume ordering on that read—meaning that you can safely read through that pointer (on all processors except the Alpha, where these would need to be strengthened to an acquire load). It's also already the default behavior on the x86 hardware, this just enforces the ordering on the compiler too and should add the required store barriers on ARM and PPC to make it viable there also.