-
Notifications
You must be signed in to change notification settings - Fork 2.7k
Implement the remaining BMI1/2 intrinsic #21480
Conversation
@@ -35,7 +35,7 @@ public abstract class X64 | |||
/// BEXTR r64a, reg/m64, r64b | |||
/// This intrinisc is only available on 64-bit processes | |||
/// </summary> | |||
public static ulong BitFieldExtract(ulong value, byte start, byte length) => BitFieldExtract(value, start, length); | |||
public static ulong BitFieldExtract(ulong value, byte start, byte length) => BitFieldExtract(value, (ushort)(start | (length << 8))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implement the 3-arg version in managed code, which significantly simplified containment codegen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does codegen look for this intrinsic? Such a combination of small integer type might result in such messy codegen that the intrinsic would be practically useless for non-constant start and length.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there are some inefficient codegen from small integer
IN0001: 000009 mov rcx, 0x1F272F13200
IN0002: 000013 mov rcx, gword ptr [rcx]
IN0003: 000016 call TestFramework:BeginScenario(ref)
IN0004: 00001B mov rcx, 0x7FF905B8EFB8
IN0005: 000025 call CORINFO_HELP_NEWSFAST
IN0006: 00002A mov rdi, rax
IN0007: 00002D mov rcx, rdi
IN0008: 000030 call ScalarTernOpTest__BitFieldExtractUInt64:.ctor():this
IN0009: 000035 mov rdx, qword ptr [rdi+8]
IN000a: 000039 mov r8, rdx
IN000b: 00003C movzx r9, byte ptr [rdi+16]
IN000c: 000041 mov ecx, r9d
IN000d: 000044 movzx rax, byte ptr [rdi+17]
IN000e: 000048 mov r10d, eax
IN000f: 00004B shl r10d, 8
IN0010: 00004F or ecx, r10d
IN0011: 000052 movzx rcx, cx
IN0012: 000055 vbextr r8, rcx, r8
IN0013: 00005A mov qword ptr [V03+0x20 rsp+20H], r8
IN0014: 00005F mov r8, 0x1F272F13200
IN0015: 000069 mov r8, gword ptr [r8]
IN0016: 00006C mov gword ptr [V03+0x28 rsp+28H], r8
IN0017: 000071 mov r8d, r9d
IN0018: 000074 mov r9d, eax
IN0019: 000077 mov rcx, rsi
IN001a: 00007A call ScalarTernOpTest__BitFieldExtractUInt64:ValidateResult(long,ubyte,ubyte,long,ref):this
IN001b: 00007F nop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, it doesn't look too bad. I'm not sure what's up with IN000a and IN000c but they look like artifacts of a rather fancy test setup, you won't get those if you just use local variables. The only real problem seems to be the useless IN0011, that's a consequence of the unfortunately typed control parameter. I think a properly implemented version of JIT's optNarrowTree
should be able to get rid of that extra cast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only real problem seems to be the useless IN0011,
Right, I think so. The problem is that JIT does not know the following consumer is a truncation and should eliminate the promotion. This issue is not specific to this intrinsic, we have detected it from other ones (e.g., Sse41.Insert
).
Seems related to https://github.com/dotnet/coreclr/issues/13210
Note, JIT prints the disasm incorrectly, that actually should be
0: 41 c1 e2 08 shl r10d,0x8
4: 41 0b ca or ecx,r10d
7: 0f b7 c9 movzx ecx,cx
a: c4 42 f0 f7 c0 bextr r8,r8,rcx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note, JIT prints the disasm incorrectly
Could you call this one out on https://github.com/dotnet/coreclr/issues/21441?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will fix the new instruction print in this PR, logged movzx
in #21441.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will fix the new instruction print in this PR,
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that JIT does not know the following consumer is a truncation and should eliminate the promotion
I wouldn't expect the JIT to start understanding such intrinsic details. Instead, it should recognize that the expression produces a value where the upper 16 bits are already 0. Though of course, it would have been easier if this problem wasn't introduced in the first place by using the wrong parameter type.
Seems related to #13210
Sort of, but that's more specific and the fix I have for that does not involve optNarrowTree
.
@tannergooding Do you where is the CoreFX update PR? I did find it... |
#21374 looks to be the last PR that updated CoreFX, which looks to be from the same day we merged the API changes. The next one should contain the fixups, but it looks like the build might be blocked. |
@dotnet-bot test this please |
@dotnet-bot test Ubuntu x64 Checked CoreFX Tests |
This PR is ready for final review. @CarolEidt @tannergooding @mikedn PTAL |
You really should fix the type of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall, with one suggested assert and some comments.
@CarolEidt Thank you for the comments. Addressed feedback. |
@tannergooding ping? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with on more assert request and one remaining typo to fix.
@@ -931,6 +931,8 @@ XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX | |||
void genHWIntrinsic_R_RM(GenTreeHWIntrinsic* node, instruction ins, emitAttr attr); | |||
void genHWIntrinsic_R_RM_I(GenTreeHWIntrinsic* node, instruction ins, int8_t ival); | |||
void genHWIntrinsic_R_R_RM(GenTreeHWIntrinsic* node, instruction ins, emitAttr attr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might it be worthwhile to just get rid of this one and have the callsites updated to call the new overload directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original overload is called by several places, that would duplicate some code without the wrapper overload. I will refactor some HW intrinsic code later, that may be a good time to reconsider this.
@@ -2835,7 +2835,9 @@ void Lowering::ContainCheckHWIntrinsic(GenTreeHWIntrinsic* node) | |||
{ | |||
MakeSrcContained(node, op2); | |||
} | |||
else if (isCommutative && IsContainableHWIntrinsicOp(node, op1, &supportsRegOptional)) | |||
else if ((isCommutative || (intrinsicId == NI_BMI2_MultiplyNoFlags) || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't we just mark MultiplyNoFlags
as commutative?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MultiplyNoFlags
has 2-arg and 3-arg overloads both. Adding the commutative flag for MultiplyNoFlags
will break some existing assumptions (e.g., if (numArgs == 3) assert(!isCommutative);
). I think that is not worthwhile for this very special one.
MakeSrcContained(node, op1); | ||
// MultiplyNoFlags is a Commutative operation, so swap the first two operands here | ||
// to make the containment checks in codegen significantly simpler | ||
*(originalArgList->pCurrent()) = op2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is similar to the FMA case and we could just mark either op1 or op2 as contained without switching here. We already have the logic in codegen to extract the appropriate registers. Also doing the swap for op1/op2 depending on which is contained should also be trivial to support there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are reusing the existing code for 2-arg overload that may swap op1 and op2 in lowering. So, also swapping op1/op2 here for 3-arg version will unify the following code.
@dotnet-bot test this |
@tannergooding Does this PR look good to you? CI seems stuck again, but it was all green... |
Again, you should fix |
@mikedn Sorry to miss your comment. As I said above, the codegen issue is not specific to |
This is not what was proposed/reviewed/approved by the framework API review team. If we want it changed and there is a strong/sufficient reason to have it changed, we need a separate API proposal requesting it. It will then need to be fast-tracked through the review process after everyone gets back from holiday. However, I'm not sure that the API review team will opt for changing the API this late (cc. @terrajobst). It was already fairly extensively discussed whether we would expose Having |
The "reasonable" thing is not always obvious. It is generally the case that people have different views on the "correct" approach to designing an API or solving a problem. And sometimes that view is skewed by a persons experience or inexperience in a particular area. Not everyone is an expert on the runtime or on low-level code in general, but each person brings something unique and useful to the table, including our external contributors.
Yes, this looks to be a case that we missed and we should aim for consistency. We might not have found it if it weren't for people looking at the surface area and pointing it out. The API surface area for the HWIntrinsics isn't exactly small and it has undergone a few revisions as people have used the previews and provided feedback.
I will try and log an issue around the concern you've raised here. However, I am not you and may not be able to 100% capture your viewpoint or correctly relay the comments your are making. I can only capture what I've interpreted your concern to be, and link back to the conversation here. |
Again, if you do not know what you're doing then the best thing to do is not to wade into uncharted territories by inventing things that do not need to be invented. But no, let's deviate from typical .NET API conventions (most use |
@mikedn Thanks for pointing this out.
I went over all the scalar ISA APIs, almost all the APIs operate over |
We are much later in the release cycle now and we did follow up with all of the previous changes to get them affirmed. This change would be no different. I dont think we will move anywhere actionable on this until after the holidays are over. It would be beneficial, in the meantime, if we could get an issue opened tracking this. It might also be beneficial to list the scalar intrinsics, what they currently take as input arguments, and what the actual instruction takes (including limitations on the input range). -- I dont believe we have that many that operate directly on scalar values, so the list hopefully wont be too large |
Hmm, that is unfortunate to be late in the release cycle. Okay, in my opinion, let's keep the current design, then discuss this topic in the next API review if possible. Although the current BMI APIs has a bit inconsistency, other solutions also have disadvantages (e.g., make scalar APIs inconsistent from SIMD, or make some APIs less intuitive, etc.). This is a really small issue (its drawback can be solved by optimizations), it is not worth to block the whole feature release. Make sense? |
I agree. Any future API modifications shouldn't block this PR from going through.
This shouldn't put the feature at risk, and we should still have plenty of time to get the inconsistency reviewed and fixed via the appropriate process. I would not be in favor of making a decision here while everyone is on holiday, especially since this will likely require input from more people and some more context on the shape/limitations on the other APIs we've exposed. |
Rebased to try CI testing again. |
@dotnet-bot test Windows_NT x64 Checked Innerloop Build and Test |
@dotnet/dnceng, are there any known issues with the Arm/OSX machines right now? Seems the jobs keep losing connecting mid run. |
And two Windows_NT x64 tests triggered but not started (too long queue?). |
@tannergooding the power spikes have turned off all physical machines. There is an ongoing outage that is currently being worked on. Updates have been sent to the partners alias and we will continue to update that email thread when we have more. |
Most jobs get green, @tannergooding can we ignore the CI stuck and the unrelated failure in "Windows_NT x64 Checked CoreFX Tests"? |
test Ubuntu arm Cross Checked crossgen_comparison Build and Test |
I would like to see the Ubuntu arm jobs passing, as they've started working on the other PRs now. The |
Thanks! |
* Add tests for BMI1/2 intrinsic * Implement the remaining BMI1/2 intrinsic * Fix emitDispIns for BMI instruction Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
* Add tests for BMI1/2 intrinsic * Implement the remaining BMI1/2 intrinsic * Fix emitDispIns for BMI instruction Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
* Add tests for BMI1/2 intrinsic * Implement the remaining BMI1/2 intrinsic * Fix emitDispIns for BMI instruction Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
* Add tests for BMI1/2 intrinsic * Implement the remaining BMI1/2 intrinsic * Fix emitDispIns for BMI instruction Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
* Add tests for BMI1/2 intrinsic * Implement the remaining BMI1/2 intrinsic * Fix emitDispIns for BMI instruction Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
* Add tests for BMI1/2 intrinsic * Implement the remaining BMI1/2 intrinsic * Fix emitDispIns for BMI instruction Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
* Add tests for BMI1/2 intrinsic * Implement the remaining BMI1/2 intrinsic * Fix emitDispIns for BMI instruction Commit migrated from dotnet/coreclr@7ac4a46
This PR implements the remaining BMI1/2 intrinsic.
After this PR, BMI1 and BMI2 get fully implemented and all the intrinsic APIs proposed in https://github.com/dotnet/corefx/issues/22940 are enabled for .NET Core 3.0.
Close https://github.com/dotnet/coreclr/issues/18712 and close https://github.com/dotnet/corefx/issues/22940
@tannergooding @CarolEidt