Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-implement AArch64 atomic load and stores #3128

Merged
merged 2 commits into from
Aug 6, 2021

Conversation

sparker-arm
Copy link
Contributor

@sparker-arm sparker-arm commented Jul 29, 2021

The AArch64 support was a bit broken and was using Armv7 style
barriers, which aren't required with Armv8 acquire-release
load/stores. The backend now pattern matches uextend(atomic_load) too.

The fallback CAS loops and RMW, for AArch64, have also been updated
to use acquire-release, exclusive, instructions which, again, remove
the need for barriers. The CAS loop has also been further optimised
by using the extending form of the cmp instruction.

Copyright (c) 2021, Arm Limited.

@sparker-arm
Copy link
Contributor Author

@uweigand I made some changes in the s390x backend and I'm really not sure if it's right, could you take a look please?

@github-actions github-actions bot added cranelift Issues related to the Cranelift code generator cranelift:area:aarch64 Issues related to AArch64 backend. cranelift:area:x64 Issues related to x64 codegen cranelift:meta Everything related to the meta-language. cranelift:wasm labels Jul 29, 2021
@@ -4579,8 +4579,7 @@ pub(crate) fn define(
r#"
Atomically load from memory at `p`.

This is a polymorphic instruction that can load any value type which has a memory
representation. It should only be used for integer types with 8, 16, 32 or 64 bits.
It should only be used for integer types with 32 or 64 bits.
Copy link
Contributor

@bjorn3 bjorn3 Jul 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't change this. In rustc_codegen_cranelift I use atomic_load to load true i8 and i16 values without any zero-extension afterwards.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, it's better to be consistent with normal load instructions here and accept any type, I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I'm attempting to revert the IR changes, so we have the atomic_load + uextend. So during lowering, I'm trying to pattern match from the uextend, which I think works fine. But then I'm assuming that the AtomicLoad node is also being lowered so my output code ends up with two loads. Is there a way to explicitly replace a value or remove an instruction?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, interesting! So the lowering framework keeps track of reference counts for values; instructions are lowered in a backward pass where inputs are only produced if something depends on them, or if they have side-effects. However in this case the load is considered to have a side effect (it can trap). This actually needs slightly different handling; it's not legal to duplicate the load (imagine if e.g. a store interleaved).

In cases where we merge an effectful op (like a load), we need to call sink_inst() on the context as well; see this comment. An example of how this is used is here in the x64 backend.

Sorry this is a bit subtle!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_mergeable_load returns None for ty.bits() < 32.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's specific to the x64 backend, because of the way that narrow loads become wider in some cases; I imagine that doesn't apply here since the widening loads don't reach beyond their width.

ig.push(
Inst::new(
"atomic_store",
r#"
Atomically store `x` to memory at `p`.

This is a polymorphic instruction that can store any value type with a memory
representation. It should only be used for integer types with 8, 16, 32 or 64 bits.
This is a polymorphic instruction that can store a 32 or 64-bit value.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

cbnz x24, again
dmb ish
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to keep ARMv7 support and only emit the new sequence for ARMv8?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We certainly will need a similar sequence when we complete the ARM32 (ARMv7) backend, but that doesn't share code with the aarch64 (aka ARMv8) one we're in here, so I think it's fine/expected that we avoid ARMv7-isms here.

ctx.emit(match (ty_bits(access_ty), ty_bits(ty)) {
(8, 32) => Inst::Load32ZExt8 { rd, mem },
(8, 64) => Inst::Load64ZExt8 { rd, mem },
(16, 32) => Inst::LoadRev16 { rd, mem },
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is wrong, because the lrvh instruction does not perform any extension. To support this, you'd have to do the same that is currently done for a Uload16 and emit an explicit Extend after the load-reverse.

Also, I think you need to support the same set of operations for the little-endian case as for the big-endian case, which means the 16->64 and 32->64 extension cases also should be added. They need to be implemented via the same explicit Extend after the load.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, thanks.

block0(v0: i64):
v1 = atomic_load.i16 little v0
v1 = atomic_uload16.i32 little v0
return v1
}

; check: lrvh %r2, 0(%r2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test case is likewise wrong. You can look at load-little.clif for a complete set of load test cases. Those should be usable for atomic loads as well (aligned loads are atomic by default on the architecture).

The AArch64 support was a bit broken and was using Armv7 style
barriers, which aren't required with Armv8 acquire-release
load/stores.

The fallback CAS loops and RMW, for AArch64, have also been updated
to use acquire-release, exclusive, instructions which, again, remove
the need for barriers. The CAS loop has also been further optimised
by using the extending form of the cmp instruction.

Copyright (c) 2021, Arm Limited.
Along with the x64 and s390x changes. Now pattern matching the
uextend(atomic_load) in the aarch64 backend.
@sparker-arm sparker-arm changed the title Re-implement atomic load and stores Re-implement AArch64 atomic load and stores Aug 5, 2021
Copy link
Member

@cfallin cfallin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with the updates; thanks for the patience (and sorry for the delay in looking again)!

@cfallin cfallin merged commit 9c550fc into bytecodealliance:main Aug 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cranelift:area:aarch64 Issues related to AArch64 backend. cranelift:area:x64 Issues related to x64 codegen cranelift:meta Everything related to the meta-language. cranelift:wasm cranelift Issues related to the Cranelift code generator
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants