-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DO NOT MERGE] [Pre-RFC] ACLE in Rust #184
Conversation
Since CMSIS is also a vendor-published standard containing intrinsics, can we bring in BKPT from CMSIS? It also contains I can completely see the reasoning for implementing ACLE in |
Does that mean that |
Yes, see ACLE 12.1.5:
|
There are 3 worlds:
armvcc v4/v5 support armcc v6 use a compatibility include file ( So to enable We can have the same approach on Rust. That is more similar to a ACLE + some legacy intrinsic, Also pay attention that:
|
Here some intrinsics that are supported in ARM Compiler 6 using |
No objection to this way of describing it, though I think the end result is the same, but I note we require vendor-published specifications for intrinsics in
Would you implement both then? In CMSIS for Cortex-M specifically there's no |
We'd be putting the intrinsics from ACLE into
Right now |
I'm OK with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I double-checked the memory barrier stuff against the ARM documentation and found some things that didn't quite add up. See comments for details. I only glanced over the rest of the document (ran out of time).
Please note that I know very little about ACLE, so I might be missing some context here and could be totally wrong.
|
||
[ACLE]: #acle | ||
|
||
[ARM C Language Extensions Q2 2018](https://silver.arm.com/download/ARM_and_AMBA_Architecture/AR580-DA-70000-r0p0-06rel0/DDI0403E_c_armv7m_arm.pdf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This links to the ARMv7-M Architecture Reference Manual, not ACLE.
I believe this is the correct link:
https://developer.arm.com/products/software-development-tools/compilers/arm-compiler-5/docs/101028/latest/1-preface
// SY, LD, ST, ISH, ISHLD, ISHST, NSH, NSHST, OSH, OSHLD, OSHST | ||
// | ||
// Only SY implements the `Isb` trait | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(this comment applies to the complete code example above)
You marked some of those parameters as being only present on aarch64
, but according to this code, all others are available everywhere. This doesn't look quite right to me:
- I could find nothing in the ACLE docs on why the parameters you marked as
#[cfg(target_arch = "aarch64")]
should be marked such. - For all three instructions, only the
SY
parameter is supported on Cortex-M. See ARMv6-M Architecture Reference Manual pages 121, 122, 124; ARMv7-M Architecture Reference Manual pages 236, 237, 241; ARMv8-M Architecture Reference Manual pages 422, 423, 431.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could find nothing in the ACLE docs on why the parameters you marked as #[cfg(target_arch = "aarch64")] should be marked such.
indeed. This piece of information is in the armasm Reference Guide. Though my main motivation for excluding those is that LLVM will reject asm!("DMB LD")
when compiling for ARMv7 and older so this has to be done otherwise we can't implement the ACLE API.
For all three instructions, only the SY parameter is supported on Cortex-M.
Yes, the other parameters are reserved. We could conditionally exclude them from the API but AIUI ACLE doesn't exclude them. Also, LLVM is happy to accept e.g. DMB ST
; it will simply be executed as DMB SY
by the target system -- though the document you linked says the software shouldn't rely on this behavior.
#[cfg(not(/* like above */)] | ||
unsafe fn dmb(&self) { | ||
// No-op but still a compiler barrier because of "memory" | ||
asm!("" : : : "memory" : "volatile"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above quote from ACLE states:
They may be no-ops (i.e. generate no code, but possibly act as a code motion barrier in compilers) on targets where the relevant instructions do not exist, but only if the property they guarantee would have held anyway.
Do we know that the property the instructions guarantee will hold with this implementation? I'm not saying they don't, just asking whether anyone has thought through this. I'm not familiar with the architectures in question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am rather new to ACLE and CMSIS so please ignore if I am just about to show my ignorance ;)
I really wonder if the target gating here is a boon or a bane.
If I am a programmer knowledgeable enough that I want to insert something as low level as a DMB, I will probably anyways work with the architecture reference manual of my target CPU, and can just double check what my options are for DMB and then write something that explicitly emits what I want.
With a target-gated __dmb()
in std
, I additionally have to cross-check if Rust's implementation of it does for my architecture what I expect it to do (in contrast to doing the target gating in my own code).
I know that portability is why ACLE and CMSIS exist. Are they used often in the wild?
Because I wonder if we can be sure that there aren't cases where ACLE assumptions result in unwanted behavior for this or that target. Just thinking out loud... 😕
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With a target-gated __dmb() in std, I additionally have to cross-check if Rust's implementation of it does for my architecture what I expect it to do (in contrast to doing the target gating in my own code).
The answer is in the ACLE spec:
The intrinsics in this section are available for all targets. They may be no-ops (i.e. generate no code, but possibly act as a code motion barrier in compilers) on targets where the relevant instructions do not exist, but only if the property they guarantee would have held anyway.
> portable across compilers, and across Arm architecture variants, while | ||
> exploiting the advanced features of the Arm architecture. | ||
|
||
## Memory barriers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// specified scope) before memory accesses issued after the DMB. | ||
/// | ||
/// For example, DMB should be used between storing data, and updating a flag | ||
/// variable that makes that data available to another core. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why aren't atomics enough for this operation? Or is this just implemented as an atomic operation ? (if so, it should state which one exactly).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that atomic::fence
produces a DMB instruction (and that all non-relaxed atomic ops involve atomic fences) so I checked what that function produces for different arguments and different compilation targets:
For aarch64-unknown-linux-gnu
, atomic::fence(Ordering::Acquire)
produces DMB ISHLD
and the other 3 (excluding Ordering::Relaxed)
produce DMB ISH
For thumbv7m-none-eabi
and similar, atomic::fence(*)
(any ordering but relaxed) produces DMB SY
.
Why aren't atomics enough for this operation?
The API in std::sync::atomic
doesn't seem to cover all the possible ways one can use the DMB instruction. The aarch64-linux
target in particular seems to be geared towards homogeneous multicore systems therefore it uses the ISH variants; my understanding is that heterogeneous multicore systems (Cortex-A + GPU, or Cortex-A + Cortex-M) and embedded devices (a core and peripherals connected to the same data bus) need the stronger SY (full system) variants for proper synchronization.
Or is this just implemented as an atomic operation ?
These will be implemented using inline assembly, not atomic::fence
or any other LLVM (fence) intrinsic.
if so, it should state which one exactly
At the beginning of this comment I wrote what atomic::fence
maps to today, but I don't know if it's guaranteed those mappings will hold in the future, or if changing the codegen options changes the mappings. My understanding is that replacing DMB ISH
with DMB SY
is valid because the later has stronger guarantees so that would be a valid change for LLVM to make in the future.
So I wouldn't write that "DMB ISH
maps to atomic::fence(Ordering::SeqCst)
" in the docs because I'm not sure it will always hold.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The aarch64-linux target in particular seems to be geared towards homogeneous multicore systems therefore it uses the ISH variants; my understanding is that heterogeneous multicore systems (Cortex-A + GPU, or Cortex-A + Cortex-M) and embedded devices (a core and peripherals connected to the same data bus) need the stronger SY (full system) variants for proper synchronization.
I believe big.LITTLE systems are reasonably common on high-end mobile SoCs that run Linux, so I would be surprised if the usual atomics weren't enough for those sorts of systems. However, it sounds quite plausible that you'd need a stronger barrier for things like DMA transfers or memory shared with a GPU.
At the beginning of this comment I wrote what atomic::fence maps to today, but I don't know if it's guaranteed those mappings will hold in the future, or if changing the codegen options changes the mappings.
There are multiple "minimal" compilation strategies for atomics that are sound (see https://www.cl.cam.ac.uk/~pes20/cpp/cpp0xmappings.html), so the mapping could theoretically change even without "useless" strengthenings like DMB ISH -> DMB SY. One practical constraints, though, is that all code working on the same model needs to be compiled with the same mapping (again see the link).
I'd think the more important knowledge for a programmer is the opposite direction: which uses of these intrinsics subsume which atomics? For example, in which cases can atomic::fence(...); __dmb(...);
be collapsed to just __dmb(..);
? I don't have a good answer for that either, but at least in principle it can be answered by correlating the formal definitions of the relevant ISA memory model (note: there are multiple substantially different ones even for ARMv8) and language memory model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The API in std::sync::atomic doesn't seem to cover all the possible ways one can use the DMB instruction.
YES! And probably it can't. For example DMB OSH can be used only if "DMA is Bufferable" (CONFIG_ARM_DMA_MEM_BUFFERABLE on Linux) OR in SMP mode, DMB SY otherwise. But only an OS or a bare-metal applications can have such infos.
So like in the implementation of memory barriers these intrinsics will do | ||
nothing on *some* sub-architectures. | ||
|
||
## System register access |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are systems registers? Particularly, which operations modify them?
Is LLVM allowed to assume when calling an extern function that the contents of system registers won't change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
System registers can be one of (this list may not be exhaustive): status register, program counter, link register, the stack pointer and special registers present in the Cortex-M and Cortex-R variants.
The status register can change whenever the processor executes a instruction as it contain flags that indicate if the last operation result in a zero, or if it overflowed, etc. The program counter changes every time the processor executes a instruction. The link register will change every time there's a function call via the BL (branch link) or BLX (branch link exchange) instruction. The compiler will generate instructions that change the stack pointer to allocate space for stack variables.
I know of no practical use case for doing RMW (read-modify-write) operations on those registers exactly because compiler optimizations make the contents of these registers unpredictable, if that's what you are concerned of.
There are use cases for reading things like the stack pointer and the linker register and then restoring them at a later time to implement context switching but those routines have to be written in assembly to work properly (you can't piece together intrinsics and get the behavior you want).
There is a use case for reading registers like the stack pointer for debugging / diagnostics purposes.
The Cortex-M / Cortex-R special register include registers like BASEPRI, PRIMASK, FAULTMASK. These are used to create critical sections and you will be doing reads and writes to these. The writes need to behave as compiler barriers ("memory" clobber in asm!
) to get the correct behavior.
Is LLVM allowed to assume when calling an extern function that the contents of system registers won't change?
If by "extern" you mean FFI then the answer is no. As I mention in the beginning even single instructions will change some of these registers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a couple of comments, where I wonder basically how some of these intrinsics (barriers, writes to system registers) interact with Rust's / LLVM memory model.
With respect to memory barriers, their specification just reads too much like "memory model"-speak in the context of atomics. If they just perform atomic operations, they should just state exactly which atomic operation they perform. If they differ from the existing atomic operations, they should state exactly how do they differ, and what they allow that the current atomic operations do not. I am not opposed to adding aliases for already existing atomic operations, but if that's the case, I'd prefer these to just link to those operations appropriately so that we don't have to maintain memory model minutiae in this part of the std library. If these are new atomic operations that are currently not available in Rust, I'd suggest involving the memory model working group in this issue and their precise specification.
With respect to writes to system registers, I wonder if they can actually work correctly. On x86, for some registers, they do not - no idea about ARM but the RFC should convince me that this is the case. That is, that if I read a value from a system register, modify it, and write it back, that LLVM hasn't written something else to the system register while I was modifying it, and that my modification won't introduce undefined behavior. Sadly, I don't know what these systems registers are, or which state they store, and explaining all of it might not be relevant for the whole RFC, but maybe somebody can answer in a comment, and a small summary can be included in the final RFC.
IIUC in the last discussion, CMSIS isn't an ARM C API, but a hardware abstraction layer that builds on top of ACLE. The consensus was that it belongs in a separate crate, and some people were against exposing multiple ways of doing the same thing in If there are some CMSIS APIs that cannot be implemented on top of ACLE, it would be nice to figure out which ones they are, and why they cannot be implemented. This will be easier once ACLE is implemented, and a cmsis crate is implemented on top. Based on the result, we should reconsider the decision of also exposing CMSIS from It might be, however, worth it to leave CMSIS out for now, and focus on implementing ACLE in std::arch first. We can always add the CMSIS intrinsics later, and stabilize them independently, and with the proof that the cmsis crate cannot be implemented on top of ACLE, then doing so should be pretty uncontroversial. |
CMSIS is wild potpourri of a lot of different things, CMSIS-CORE is indeed a Cortex-M specific C API. ACLE is a set of compiler intrinsics and macros to allow easier access to various processing units (e.g. DSP, SIMD, Secure Elements...). The only things CMSIS-CORE (and a few more building blocks like CMSIS-RTOS and CMSIS-DRIVER) and ACLE have in common is that they're really only relevant to C/C++ programming languages and in my opinion do not make a whole lot of sense in the Rust world at all. |
What do you think would make more sense? |
@gnzlbg Well, we need the assembler intrinsics so it's certainly a good idea to check CMSIS-CORE and ACLE for the supplied ones and pick the important ones. Anything higher level and even the weird naming is irrelevant for Rust. After all CMSIS-CORE and ACLE are ARMs attempt to provide a framework for lowish-level developments and different compiler vendors; we already have those frameworks with |
The key point is that the goal here is to avoid But usually when a spec. leave something out (for example Also ACLE deprecate some 'dsp' intrinsics on Cortex-A in favour of similar 'neon' intrinsics tha CMSIS still support. This is a clear "future silicon design direction" case. Also there is nothing for Cortex-R on CMSIS and brand-new Cortex-R. In a sense ACLE dictate "future direction" instead CMSIS is just a wrapper around different compilers and do not forget that CMSIS was born before ACLE. |
I try to explain why Mainly a debugger insert a
Of course cache coherency issues might arise when writing a BKPT instruction, so some debugger In general on Cortex-R/A the debug software must carefully program certain debug events to prevent the Also BKPT have a different behaviour on Prefect Abort events if processor is or not in debug mode, so debugger control under the hood the On Cortex-R/A a Prefetch Abort or Data Abort handler must check the value of the CP15 Fault Status Register too. So |
I have put up an implementation of the proposed API in the There's one (*) deviation from the RFC / ACLE: If a non-SIMD instruction is not available for the target architecture then To check that you can implement the CMSIS-CORE API with this I made a
I have left out all the DSP / SIMD32 intrinsics as they are not priority right Finally, to actually implement this I had to whitelist a few more target
The source code of Also API docs:
(*) OK, I lied. There's actually one more minor differences:
@gnzlbg does this seem like a reasonable implementation / API to you? |
Don't we have a plan to support 'ANGEL' interface which can be used in QEMU I'm trying to write my own using asm but if crate support is available it would be useful. jamesmunns/irr-embedded-2018#4 (comment) Angel Interfacehttp://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0471c/Bgbjjgij.html UpdateCurrently the crate |
I think this is fine, and we can always change that in a backwards compatible way to provide these as
In my experience what people do is write So I finally managed to review the whole crate. I think you can pretty much send it as a PR (the |
Hi @japaric thanks for writing this RFC, I am happy to see progress being made on this front 😄. My apologies for not replying sooner, I wanted to talk to my colleagues at Arm who write the ACLE first to see if they had anything to add or any recommendations. First some general notes
Some feedback specific to the RFC
|
dsp/simd32 are already implemented (I sent 4-5 PRs in the past). Following this RFC We need to reorganize the code a bit, so I agree to keep them in this RFC. |
I pretty much agree with everything you said, and I think there should be a crate built on stable Rust on top of The situation is pretty much the following: If we add ACLE intrinsics to So at this point I still think that the best is to just implement ACLE as close to the C spec as possible, even if it is horrible to use, to make the RFC process go as smoothly as possible, and just build a better API on top in a separate crate on crates.io that just doesn't have to go through the process and that we can evolve over time (hopefully ARM can contribute its experience to that crate). This might sound like we are choosing an imperfect solution for @japaric I am going to post some of @parched comments in the PR in stdsimd so that we can discuss how to "fix" them (or whether they can be fixed there). |
(As per this comment #63 (comment) I'm going to remove this from the edition milestone). |
rust-console could use a way to emit a |
Please note that SVC, depending on the service call handler/application, would need to return a u32 type or something a little bit more heavy. Ideally, it should probably be better if SVC can be declared with a customizable return type. |
The ARM AAPCS (ARM IHI 0042F) says the returning convention for a function is to use registers the following way:
So, to me it looks like the SVC intrinsic should be able to specify a return type, which, on the back end, for reasons of interoperability with C code, should follow these conventions. Note that even if, in theory and for optimization reasons, one could extend this to other registers, that would not work since only r0-r3, r12, lr, pc and xPSR are saved on the exception stack frame, except, maybe abusing r12, but that's probably too much trouble for little gain. |
@japaric Does this Pre-RFC still need some work to do? |
This might need to be updated with what is implemented on nightly. Also for an RFC, this goes into a lot of detail about each of the APIs. If one checks the |
We are unlikely to propose this RFC as it is so I'm going to close this PR. Let's continue the discussion on stabilization in #63. |
This is a pre-RFC of an RFC meant for rust-lang/rfcs. The reasons why I'm
putting it up as a PR in this repository are that (a) PRs are nice for reviewing
these kind of documents and that (b) all ARM developers should be interested
in this and good chunk of the people subscribed to this repo are ARM developers.
The rationale for this RFC is in #63 but the TL;DR is that we want to be able to
use instructions like
ISB
,WFI
andMRS BASEPRI
on stable w/o depending on anexternal assembler like
arm-none-eabi-gcc
.The RFC is still incomplete but I'd like to get feedback on the proposed API for
memory barriers, hints and system register access.
Some issues I see with ACLE as a Cortex-M person:
No intrinsic for the BKPT instruction. There's an intrinsic for the DBG
instruction but according to my tests that's treated as a NOP by the debugger,
i.e. it can't be used as a hardware breakpoint.
No intrinsics for the CPSID and CPSIE instructions. These are used to
implement the
{disable,enable}_interrupt
API in thecortex-m
crate.However, the are intrinsics to write to PRIMASK. I think it should be possible to
implement
{disable,enable}_interrupt
by writing to PRIMASK, but I think thatimplementation would result in more instructions being emitted.
Rendered version
cc @rust-embedded/cortex-m @parched @gnzlbg @hannobraun @paoloteti @andre-richter @wizofe