-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cranelift: add support for the Mac aarch64 calling convention #2742
Conversation
Cargo.lock
Outdated
@@ -127,9 +127,9 @@ checksum = "23b62fc65de8e4e7f52534fb52b0f3ed04746ae267519eef2a83941e8085068b" | |||
|
|||
[[package]] | |||
name = "async-trait" | |||
version = "0.1.42" | |||
version = "0.1.48" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This bumps more than target-lexicon. Could the other updates be split into a new PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, just ran cargo update
. I'm happy to do so, or revert the updates unrelated to target-lexicon. But in general patch bumps shouldn't bring in breaking changes (that's unfortunately not the case for this target-lexicon patch bump, which brings a breaking API change), what advantages do you see in doing so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For 0.x.y
, y
bumps may still add features. Only breaking changes are not allowed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've opened bytecodealliance/target-lexicon#71 to discuss this, since in fact I think that adding the new enum variant is a breaking API change (this may break users' match statements, since the enum isn't marked non-exhaustive), but it's subtle. In any case, I've reverted the other packages updates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've bumped target-lexicon to 0.12.0, which is the new pre-major release for this new enum variant.
let size = std::cmp::max(size, 8); | ||
// Align. | ||
|
||
let size = if call_conv != isa::CallConv::AppleAarch64 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When depending on new features of a dependency could you also update the dependency requirement in Cargo.toml
? That would be useful for users like cg_clif that only update a single crate at a time and keep the rest pinned using Cargo.lock
.
// MacOS aarch64 allows stack slots with sizes less than 8 | ||
// bytes. They still need to be properly aligned on their | ||
// natural data alignment, though. | ||
size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could the fast and cold call conv use this too? They are unstable anyway and this saves stack usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe? It's a bit out of scope for this PR, so I'll try not to accidentally introduce new changes there. Plus, I am not sure if these conventions are used; I seem to recall that there fast implies the default calling convention in some cases, and if we're not being careful that might mean subtly breaking other calling conventions. We should probably audit the "fast"/"cold" calling conventions at some point, and re-design them from the ground up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 -- I actually think we should do something about the "fast" and "cold" conventions, but we should be a little more explicit in designing them, and probably give them better names. I'm not a huge fan of generic terms like "fast ABI" because the ambiguity is confusing -- both on the user/embedder side ("what guarantees does this have? how fast is fast? when can I use it?") and on the implementer side (do we just choose an array of features that lead to better speed, and add more as we think of them? IMHO an evolving ABI is a recipe for subtle bugs as we mutate invariants over time).
So I'd rather design a "fast internal Cranelift ABI v1", implement it, and then keep that as a first-class, well-defined ABI alongside the others, and retire names like "fast" and "cold", just for clarity's sake. But, that's a deeper discussion for another day, I think!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Overall looks good and I'm very happy to see macOS/aarch64 gain support!
One small thing, in addition to the comment below: could you document the parts of the ABI that we don't do yet (the extension behavior specifically, and i128
differences?), perhaps linking to an issue, so we don't lose track of it?
@@ -171,6 +171,9 @@ impl ABIMachineSpec for AArch64MachineDeps { | |||
let has_baldrdash_tls = call_conv == isa::CallConv::Baldrdash2020; | |||
|
|||
// See AArch64 ABI (https://c9x.me/compile/bib/abi-arm64.pdf), sections 5.4. | |||
// MacOS aarch64 is slightly different, see also |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing I noticed here: "If the total number of bytes for stack-based arguments is not a multiple of 8 bytes, insert padding on the stack to maintain the 8-byte alignment requirements." However below we align the final stack-arg area size up to a 16-byte alignment.
(Related to this, my understanding is that the trap-on-not-16-aligned-SP behavior of aarch64 is configurable with a mode bit as well; maybe this means Apple runs with only an 8-aligned stack?)
I think this is OK as it should be fine to reserve extra space at a callsite, but we should document that we diverge and why it's OK (and verify that it is!).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and it maintains our invariant that the stack is always allocated in 16-bytes chunked, thus always aligned, which is nice. (Otherwise would require more changes when generating prologues and epilogues.)
This bumps target-lexicon and adds support for the AppleAarch64 calling convention. Specifically for WebAssembly support, we only have to worry about the new stack slots convention. Stack slots don't need to be at least 8-bytes, they can be as small as the data type's size. For instance, if we need stack slots for (i32, i32), they can be located at offsets (+0, +4). Note that they still need to be properly aligned on the data type they're containing, though, so if we need stack slots for (i32, i64), we can't start the i64 slot at the +4 offset (it must start at the +8 offset). Added one test that was failing on the Mac M1, as well as other tests stressing different yet similar situations.
Subscribe to Label Actioncc @fitzgen, @peterhuene
This issue or pull request has been labeled: "cranelift", "cranelift:area:aarch64", "cranelift:area:machinst", "cranelift:area:x64", "cranelift:meta", "cranelift:module", "cranelift:wasm", "fuzzing", "wasmtime:api"
Thus the following users have been cc'd because of the following labels:
To subscribe or unsubscribe from this label, edit the |
This bumps target-lexicon and adds support for the AppleAarch64 calling
convention. Specifically for WebAssembly support, we only have to worry
about the new stack slots convention. Stack slots don't need to be at
least 8-bytes, they can be as small as the data type's size. For
instance, if we need stack slots for (i32, i32), they can be located at
offsets (+0, +4). Note that they still need to be properly aligned on
the data type they're containing, though, so if we need stack slots for
(i32, i64), we can't start the i64 slot at the +4 offset (it must start
at the +8 offset).
Added one test that was failing on the Mac M1, as well as other tests
stressing different yet similar situations.
Fixes #2734.
(Note: more work will likely be needed to accommodate non-wasm uses: sign- or zero- extension of < 32 bits arguments + i128 registers proper register passing. Happy to try PRs and confirm they work or not here.)