-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Re-implement AArch64 atomic load and stores #3128
Conversation
@uweigand I made some changes in the s390x backend and I'm really not sure if it's right, could you take a look please? |
@@ -4579,8 +4579,7 @@ pub(crate) fn define( | |||
r#" | |||
Atomically load from memory at `p`. | |||
|
|||
This is a polymorphic instruction that can load any value type which has a memory | |||
representation. It should only be used for integer types with 8, 16, 32 or 64 bits. | |||
It should only be used for integer types with 32 or 64 bits. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't change this. In rustc_codegen_cranelift I use atomic_load
to load true i8 and i16 values without any zero-extension afterwards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, it's better to be consistent with normal load instructions here and accept any type, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I'm attempting to revert the IR changes, so we have the atomic_load + uextend. So during lowering, I'm trying to pattern match from the uextend, which I think works fine. But then I'm assuming that the AtomicLoad node is also being lowered so my output code ends up with two loads. Is there a way to explicitly replace a value or remove an instruction?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, interesting! So the lowering framework keeps track of reference counts for values; instructions are lowered in a backward pass where inputs are only produced if something depends on them, or if they have side-effects. However in this case the load is considered to have a side effect (it can trap). This actually needs slightly different handling; it's not legal to duplicate the load (imagine if e.g. a store interleaved).
In cases where we merge an effectful op (like a load), we need to call sink_inst()
on the context as well; see this comment. An example of how this is used is here in the x64 backend.
Sorry this is a bit subtle!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is_mergeable_load
returns None
for ty.bits() < 32
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's specific to the x64 backend, because of the way that narrow loads become wider in some cases; I imagine that doesn't apply here since the widening loads don't reach beyond their width.
ig.push( | ||
Inst::new( | ||
"atomic_store", | ||
r#" | ||
Atomically store `x` to memory at `p`. | ||
|
||
This is a polymorphic instruction that can store any value type with a memory | ||
representation. It should only be used for integer types with 8, 16, 32 or 64 bits. | ||
This is a polymorphic instruction that can store a 32 or 64-bit value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
cbnz x24, again | ||
dmb ish |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to keep ARMv7 support and only emit the new sequence for ARMv8?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We certainly will need a similar sequence when we complete the ARM32 (ARMv7) backend, but that doesn't share code with the aarch64 (aka ARMv8) one we're in here, so I think it's fine/expected that we avoid ARMv7-isms here.
ctx.emit(match (ty_bits(access_ty), ty_bits(ty)) { | ||
(8, 32) => Inst::Load32ZExt8 { rd, mem }, | ||
(8, 64) => Inst::Load64ZExt8 { rd, mem }, | ||
(16, 32) => Inst::LoadRev16 { rd, mem }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is wrong, because the lrvh instruction does not perform any extension. To support this, you'd have to do the same that is currently done for a Uload16 and emit an explicit Extend after the load-reverse.
Also, I think you need to support the same set of operations for the little-endian case as for the big-endian case, which means the 16->64 and 32->64 extension cases also should be added. They need to be implemented via the same explicit Extend after the load.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, thanks.
block0(v0: i64): | ||
v1 = atomic_load.i16 little v0 | ||
v1 = atomic_uload16.i32 little v0 | ||
return v1 | ||
} | ||
|
||
; check: lrvh %r2, 0(%r2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test case is likewise wrong. You can look at load-little.clif for a complete set of load test cases. Those should be usable for atomic loads as well (aligned loads are atomic by default on the architecture).
The AArch64 support was a bit broken and was using Armv7 style barriers, which aren't required with Armv8 acquire-release load/stores. The fallback CAS loops and RMW, for AArch64, have also been updated to use acquire-release, exclusive, instructions which, again, remove the need for barriers. The CAS loop has also been further optimised by using the extending form of the cmp instruction. Copyright (c) 2021, Arm Limited.
Along with the x64 and s390x changes. Now pattern matching the uextend(atomic_load) in the aarch64 backend.
a9a93a4
to
b6f6ac1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with the updates; thanks for the patience (and sorry for the delay in looking again)!
The AArch64 support was a bit broken and was using Armv7 style
barriers, which aren't required with Armv8 acquire-release
load/stores. The backend now pattern matches uextend(atomic_load) too.
The fallback CAS loops and RMW, for AArch64, have also been updated
to use acquire-release, exclusive, instructions which, again, remove
the need for barriers. The CAS loop has also been further optimised
by using the extending form of the cmp instruction.
Copyright (c) 2021, Arm Limited.