-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
x64: port select to ISLE #3682
x64: port select to ISLE #3682
Conversation
This is a complex lowering due to not only the tree-matching of the CLIF IR but also the sequences required by x64. Because of this the PR will need to go through a few more changes:
I'm putting this PR up as a draft to see if I can get some help with the above and early feedback on any suggested refactoring to the rules. |
Subscribe to Label Action
This issue or pull request has been labeled: "cranelift", "cranelift:area:x64", "isle"
Thus the following users have been cc'd because of the following labels:
To subscribe or unsubscribe from this label, edit the |
size: *size, | ||
consequent: consequent.clone(), | ||
alternative: alternative.clone(), | ||
dst: dst.clone(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This overwrites dst
from above, I think -- it seems we want the first cmove to write a tmp, and alternative
in the second to be that tmp? So then we have the equivalent of cmove(cc1, a, cmove(cc2, b, alternative))
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking pretty good to me!
I think with some additional run tests we can start landing this incremental work and then finish up with the rest of the select
lowerings in follow up PRs (if you want; if you prefer doing everything in one big PR that's no problem too).
@cfallin, @fitzgen: I am finding some very strange behavior with ISLE and I'm not sure what is going on. Take, e.g., the following runtest:
If we run that (
The reason for this is that
In the code on this branch, both lowerings use the I then try to determine if in fact the "less than or equal" rule is firing in both cases. Using a well-placed But why is the rule not firing in the first case? After debugging through Any help? |
You can use (rule (lower (select val @ (value_type val_ty) ...))
...) |
@fitzgen, @cfallin: this should be ready for review. The I've also opened #3743 and #3744 as a result of this work. Work that is still TODO but could probably go in separate PRs is to translate the other side of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How confident are you in these lowerings? We talked before about getting the clif interpreter and fuzzer working for f32/f64 constants, fcmp
, and select
. Should we do that before landing these changes?
(let ((cons ValueRegs (put_in_regs consequent)) | ||
(alt ValueRegs (put_in_regs alternative)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that cmove_from_values
should take these ValueRegs
instead of Value
s. This lets us reuse it in places where we've already put the values into registers.
(decl cmove_from_values (Type CC ValueRegs ValueRegs) ConsumesFlags)
(rule (cmove_from_vlaue $I128 cc consequent alternative)
...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The downside to that is that it bloats the rules in lower.isle
; can we do this refactoring in a future PR or issue, e.g., the next time cmove is used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, that works for me.
;; TODO there must be some way to say that both of these | ||
;; instructions consume the flags together. | ||
(_ Unit (emit (MInst.Cmove size cc (RegMem.Reg (value_regs_get cons 0)) (value_regs_get alt 0) dst1)))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doing the emit
here is actually unsafe because nothing is guaranteeing that
- we just emitted the flags producer before this
- nothing else is emitted after the flags producer and before this flags consumer that clobbers the flags
What we want to do is return two ConsumesFlags
here that we can then pass into with_flags_2
along with a FlagsProducer
.
ISLE doesn't have tuple/fixed-size-arrays support, so we will have to define an extern type for this.
But! The other rule for cmove_from_values
for when it is a single-register value emits only one ConsumesFlags
, so I think we actually want an extern type that is either one or two ConsumesFlags
. I think we can just use a SmallVec<[ConsumesFlags; 2]>
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that we will want a with_flags_*
variant that takes the SmallVec<[ConsumesFlags; 2]>
as well, and it will probably have to be implemented in external Rust code so it can loop over each ConsumesFlags
in the small vec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds like a lot of new infrastructure to build and this PR may already have enough of that. The current lowering rules perform the same unsafe emission as this PR. Can't this be done afterwards?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current lowering rules for select
are written in Rust, so it isn't really comparable. It would be one thing if this emit
was near the emit
for the flags-producer instruction and the emit
for the other flags-consuming instruction, but it isn't. It isn't even clear to me, reading this code, that the instructions will always be emitted in the correct order!
It seems to me that the order of evaluation will be something like
- construct
FlagsProducer
(do not emit it) - call this constructor
- emit first cmove
- construct FlagsConsumer with the second cmove
- call a
with_flags
variant with the constructedFlagsProducer
andFlagsConsumer
- emit the producer
- emit the consumer
If that is actually how things end up evaluating, then we would get cmove; test/cmp; cmove
instead of test/cmp; cmove; cmove
.
It looks like we aren't testing select
with i128
s yet (since you have only ported floating point select
s so far). But this is exactly the kind bug I want to avoid with a dedicated with_flags_*
variant that takes SmallVec<[ConsumesFlags; 2]>
, as described above.
I really think we should fix this issue before landing this PR, or change this PR to not define and use cmove_from_values
and instead use plain cmove
(which will work for all the single-register values for select
that are are currently supported in ISLE).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The extensions to with_flags
, ProducesFlags
, and ConsumesFlags
look good!
I would vote "no." ISLE basically creates a conflict for every rebased commit requiring an ISLE rebuild and Git twiddling to ensure things are right before moving on. For those of use who try to keep a clean history with atomic commits, this additional burden makes keeping a long-running PR like this pretty painful. As you can see, I'm now just committing review patches in hopes that I can do one final rebase before merging. Also, the CLIF interpreter has not been extended for any of the other ISLE lowerings so doing it here is extra work that no other PRs have had to do. I don't mind doing this kind of thing but the "build up the infrastructure before merging" tasks are starting to pile up: bringing |
@abrown @fitzgen from my own perspective at least, I think it's fair to merge this at its current state and take some followup PRs/tasks -- I agree with @abrown that this pretty much maintains our status-quo in the sense that it passes existing tests. Deferring to @fitzgen on the detailed review discussions above of course... |
bcc3249
to
dff48f9
Compare
Not sure why aarch64 and s390x CLIF tests are failing--they shouldn't be running! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
r+ from me at least, having been through this code while resolving some issues with @abrown last week, I think the basic approach is sound and this is more than enough improvement to merge at its current state, even if we still have further to go to complete the transition on this opcode. Also, as you mentioned above, the rebasing pain increases the longer we keep this pending, and there are other changes (e.g. implicit conversions) I want to land that should come after this.
It looks like there are some filetest failures, but I didn't look in-depth at what's causing them; perhaps some holes in the pattern coverage or some code that can't quite be deleted from the handwritten side yet?
It looks like @fitzgen is happy with the refactorings we did together but I'll defer to him to make sure he's happy with the final state of this.
Thanks again for all the patience here!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, let's merge this and incrementally continue improving it in follow ups.
This change includes quite a few interlocking parts, required mainly by the current x64 conventions in ISLE: - it adds a way to emit a `cmove` with multiple OR-ing conditions; because x64 ISLE cannot currently safely emit a comparison followed by several jumps, this adds `MachInst::CmoveOr` and `MachInst::XmmCmoveOr` macro instructions. Unfortunately, these macro instructions hide the multi-instruction sequence in `lower.isle` - to properly keep track of what instructions consume and produce flags, @cfallin added a way to pass around variants of `ConsumesFlags` and `ProducesFlags`--these changes affect all backends - then, to lower the `fcmp + select` CLIF, this change adds several `cmove*_from_values` helpers that perform all of the awkward conversions between `Value`, `ValueReg`, `Reg`, and `Gpr/Xmm`; one upside is that now these lowerings have much-improved documentation explaining why the various `FloatCC` and `CC` choices are made the the way they are. Co-authored-by: Chris Fallin <chris@cfallin.org>
The `fpcmp` helper in the x64 backend uses `put_in_xmm_mem` for one of its operands, which allows the compiler to merge a load with the compare instruction (`ucomiss` or `ucomisd`). Unfortunately, as we saw in bytecodealliance#2576 for the integer-compare case, this does not work with our lowering algorithm because compares can be lowered more than once (unlike all other instructions) to reproduce the flags where needed. Merging a load into an op that executes more than once is invalid in general (the two loads may observe different values, which violates the original program semantics because there was only one load originally). This does not result in a miscompilation, but instead will cause a panic at regalloc time because the register that should have been defined by the separate load is never written (the load is never emitted separately). I think this (very subtle, easy to miss) condition was unfortunately not ported over when we moved the logic in bytecodealliance#3682. The existing fcmp-of-load test in `cmp-mem-bug` (from bytecodealliance#2576) does not seem to trigger it, for a reason I haven't fully deduced. I just added the verbatim function body (happens to come from `clang.wasm`) that triggers the bug as a test. Discovered while bringing up regalloc2 support. It's pretty unlikely to hit by chance, which is why I think none of our fuzzing has hit it yet.
The `fpcmp` helper in the x64 backend uses `put_in_xmm_mem` for one of its operands, which allows the compiler to merge a load with the compare instruction (`ucomiss` or `ucomisd`). Unfortunately, as we saw in #2576 for the integer-compare case, this does not work with our lowering algorithm because compares can be lowered more than once (unlike all other instructions) to reproduce the flags where needed. Merging a load into an op that executes more than once is invalid in general (the two loads may observe different values, which violates the original program semantics because there was only one load originally). This does not result in a miscompilation, but instead will cause a panic at regalloc time because the register that should have been defined by the separate load is never written (the load is never emitted separately). I think this (very subtle, easy to miss) condition was unfortunately not ported over when we moved the logic in #3682. The existing fcmp-of-load test in `cmp-mem-bug` (from #2576) does not seem to trigger it, for a reason I haven't fully deduced. I just added the verbatim function body (happens to come from `clang.wasm`) that triggers the bug as a test. Discovered while bringing up regalloc2 support. It's pretty unlikely to hit by chance, which is why I think none of our fuzzing has hit it yet.
* x64: port `select` using an FP comparison to ISLE This change includes quite a few interlocking parts, required mainly by the current x64 conventions in ISLE: - it adds a way to emit a `cmove` with multiple OR-ing conditions; because x64 ISLE cannot currently safely emit a comparison followed by several jumps, this adds `MachInst::CmoveOr` and `MachInst::XmmCmoveOr` macro instructions. Unfortunately, these macro instructions hide the multi-instruction sequence in `lower.isle` - to properly keep track of what instructions consume and produce flags, @cfallin added a way to pass around variants of `ConsumesFlags` and `ProducesFlags`--these changes affect all backends - then, to lower the `fcmp + select` CLIF, this change adds several `cmove*_from_values` helpers that perform all of the awkward conversions between `Value`, `ValueReg`, `Reg`, and `Gpr/Xmm`; one upside is that now these lowerings have much-improved documentation explaining why the various `FloatCC` and `CC` choices are made the the way they are. Co-authored-by: Chris Fallin <chris@cfallin.org>
The `fpcmp` helper in the x64 backend uses `put_in_xmm_mem` for one of its operands, which allows the compiler to merge a load with the compare instruction (`ucomiss` or `ucomisd`). Unfortunately, as we saw in bytecodealliance#2576 for the integer-compare case, this does not work with our lowering algorithm because compares can be lowered more than once (unlike all other instructions) to reproduce the flags where needed. Merging a load into an op that executes more than once is invalid in general (the two loads may observe different values, which violates the original program semantics because there was only one load originally). This does not result in a miscompilation, but instead will cause a panic at regalloc time because the register that should have been defined by the separate load is never written (the load is never emitted separately). I think this (very subtle, easy to miss) condition was unfortunately not ported over when we moved the logic in bytecodealliance#3682. The existing fcmp-of-load test in `cmp-mem-bug` (from bytecodealliance#2576) does not seem to trigger it, for a reason I haven't fully deduced. I just added the verbatim function body (happens to come from `clang.wasm`) that triggers the bug as a test. Discovered while bringing up regalloc2 support. It's pretty unlikely to hit by chance, which is why I think none of our fuzzing has hit it yet.
The `fpcmp` helper in the x64 backend uses `put_in_xmm_mem` for one of its operands, which allows the compiler to merge a load with the compare instruction (`ucomiss` or `ucomisd`). Unfortunately, as we saw in #2576 for the integer-compare case, this does not work with our lowering algorithm because compares can be lowered more than once (unlike all other instructions) to reproduce the flags where needed. Merging a load into an op that executes more than once is invalid in general (the two loads may observe different values, which violates the original program semantics because there was only one load originally). This does not result in a miscompilation, but instead will cause a panic at regalloc time because the register that should have been defined by the separate load is never written (the load is never emitted separately). I think this (very subtle, easy to miss) condition was unfortunately not ported over when we moved the logic in #3682. The existing fcmp-of-load test in `cmp-mem-bug` (from #2576) does not seem to trigger it, for a reason I haven't fully deduced. I just added the verbatim function body (happens to come from `clang.wasm`) that triggers the bug as a test. Discovered while bringing up regalloc2 support. It's pretty unlikely to hit by chance, which is why I think none of our fuzzing has hit it yet.
This change ports CLIF's
select
instruction to ISLE for x64.