Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Implement the remaining BMI1/2 intrinsic #21480

Merged
merged 3 commits into from
Dec 21, 2018
Merged

Implement the remaining BMI1/2 intrinsic #21480

merged 3 commits into from
Dec 21, 2018

Conversation

fiigii
Copy link

@fiigii fiigii commented Dec 11, 2018

This PR implements the remaining BMI1/2 intrinsic.

After this PR, BMI1 and BMI2 get fully implemented and all the intrinsic APIs proposed in https://github.com/dotnet/corefx/issues/22940 are enabled for .NET Core 3.0.

Close https://github.com/dotnet/coreclr/issues/18712 and close https://github.com/dotnet/corefx/issues/22940

@tannergooding @CarolEidt

@@ -35,7 +35,7 @@ public abstract class X64
/// BEXTR r64a, reg/m64, r64b
/// This intrinisc is only available on 64-bit processes
/// </summary>
public static ulong BitFieldExtract(ulong value, byte start, byte length) => BitFieldExtract(value, start, length);
public static ulong BitFieldExtract(ulong value, byte start, byte length) => BitFieldExtract(value, (ushort)(start | (length << 8)));
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implement the 3-arg version in managed code, which significantly simplified containment codegen.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does codegen look for this intrinsic? Such a combination of small integer type might result in such messy codegen that the intrinsic would be practically useless for non-constant start and length.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there are some inefficient codegen from small integer

IN0001: 000009 mov      rcx, 0x1F272F13200
IN0002: 000013 mov      rcx, gword ptr [rcx]
IN0003: 000016 call     TestFramework:BeginScenario(ref)
IN0004: 00001B mov      rcx, 0x7FF905B8EFB8
IN0005: 000025 call     CORINFO_HELP_NEWSFAST
IN0006: 00002A mov      rdi, rax
IN0007: 00002D mov      rcx, rdi
IN0008: 000030 call     ScalarTernOpTest__BitFieldExtractUInt64:.ctor():this
IN0009: 000035 mov      rdx, qword ptr [rdi+8]
IN000a: 000039 mov      r8, rdx
IN000b: 00003C movzx    r9, byte  ptr [rdi+16]
IN000c: 000041 mov      ecx, r9d
IN000d: 000044 movzx    rax, byte  ptr [rdi+17]
IN000e: 000048 mov      r10d, eax
IN000f: 00004B shl      r10d, 8
IN0010: 00004F or       ecx, r10d
IN0011: 000052 movzx    rcx, cx
IN0012: 000055 vbextr   r8, rcx, r8
IN0013: 00005A mov      qword ptr [V03+0x20 rsp+20H], r8
IN0014: 00005F mov      r8, 0x1F272F13200
IN0015: 000069 mov      r8, gword ptr [r8]
IN0016: 00006C mov      gword ptr [V03+0x28 rsp+28H], r8
IN0017: 000071 mov      r8d, r9d
IN0018: 000074 mov      r9d, eax
IN0019: 000077 mov      rcx, rsi
IN001a: 00007A call     ScalarTernOpTest__BitFieldExtractUInt64:ValidateResult(long,ubyte,ubyte,long,ref):this
IN001b: 00007F nop  

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it doesn't look too bad. I'm not sure what's up with IN000a and IN000c but they look like artifacts of a rather fancy test setup, you won't get those if you just use local variables. The only real problem seems to be the useless IN0011, that's a consequence of the unfortunately typed control parameter. I think a properly implemented version of JIT's optNarrowTree should be able to get rid of that extra cast.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only real problem seems to be the useless IN0011,

Right, I think so. The problem is that JIT does not know the following consumer is a truncation and should eliminate the promotion. This issue is not specific to this intrinsic, we have detected it from other ones (e.g., Sse41.Insert).

Seems related to https://github.com/dotnet/coreclr/issues/13210

Note, JIT prints the disasm incorrectly, that actually should be

0:  41 c1 e2 08             shl    r10d,0x8
4:  41 0b ca                or     ecx,r10d
7:  0f b7 c9                movzx  ecx,cx
a:  c4 42 f0 f7 c0          bextr  r8,r8,rcx

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note, JIT prints the disasm incorrectly

Could you call this one out on https://github.com/dotnet/coreclr/issues/21441?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will fix the new instruction print in this PR, logged movzx in #21441.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will fix the new instruction print in this PR,

Done.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that JIT does not know the following consumer is a truncation and should eliminate the promotion

I wouldn't expect the JIT to start understanding such intrinsic details. Instead, it should recognize that the expression produces a value where the upper 16 bits are already 0. Though of course, it would have been easier if this problem wasn't introduced in the first place by using the wrong parameter type.

Seems related to #13210

Sort of, but that's more specific and the fix I have for that does not involve optNarrowTree.

@fiigii
Copy link
Author

fiigii commented Dec 11, 2018

@tannergooding Do you where is the CoreFX update PR? I did find it...

@tannergooding
Copy link
Member

Do you where is the CoreFX update PR? I did find it...

#21374 looks to be the last PR that updated CoreFX, which looks to be from the same day we merged the API changes. The next one should contain the fixups, but it looks like the build might be blocked.

@fiigii
Copy link
Author

fiigii commented Dec 14, 2018

@dotnet-bot test this please

@fiigii fiigii changed the title [WIP] Implement the remaining BMI1/2 intrinsic Implement the remaining BMI1/2 intrinsic Dec 14, 2018
@fiigii
Copy link
Author

fiigii commented Dec 14, 2018

@dotnet-bot test Ubuntu x64 Checked CoreFX Tests
@dotnet-bot test Windows_NT x86 Release Innerloop Build and Test

@fiigii
Copy link
Author

fiigii commented Dec 14, 2018

This PR is ready for final review. @CarolEidt @tannergooding @mikedn PTAL

@mikedn
Copy link

mikedn commented Dec 14, 2018

You really should fix the type of BitFieldExtract's start/length/control parameters. There's no reason for those to be byte/ushort.

Copy link

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, with one suggested assert and some comments.

src/jit/emitxarch.cpp Show resolved Hide resolved
src/jit/gentree.cpp Outdated Show resolved Hide resolved
src/jit/gentree.cpp Outdated Show resolved Hide resolved
src/jit/hwintrinsiccodegenxarch.cpp Show resolved Hide resolved
src/jit/hwintrinsiccodegenxarch.cpp Show resolved Hide resolved
src/jit/hwintrinsiclistxarch.h Outdated Show resolved Hide resolved
src/jit/hwintrinsiccodegenxarch.cpp Show resolved Hide resolved
src/jit/lowerxarch.cpp Show resolved Hide resolved
@fiigii
Copy link
Author

fiigii commented Dec 15, 2018

@CarolEidt Thank you for the comments. Addressed feedback.

@fiigii
Copy link
Author

fiigii commented Dec 17, 2018

@tannergooding ping?

Copy link

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with on more assert request and one remaining typo to fix.

src/jit/hwintrinsiccodegenxarch.cpp Show resolved Hide resolved
@@ -931,6 +931,8 @@ XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
void genHWIntrinsic_R_RM(GenTreeHWIntrinsic* node, instruction ins, emitAttr attr);
void genHWIntrinsic_R_RM_I(GenTreeHWIntrinsic* node, instruction ins, int8_t ival);
void genHWIntrinsic_R_R_RM(GenTreeHWIntrinsic* node, instruction ins, emitAttr attr);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might it be worthwhile to just get rid of this one and have the callsites updated to call the new overload directly?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original overload is called by several places, that would duplicate some code without the wrapper overload. I will refactor some HW intrinsic code later, that may be a good time to reconsider this.

@@ -2835,7 +2835,9 @@ void Lowering::ContainCheckHWIntrinsic(GenTreeHWIntrinsic* node)
{
MakeSrcContained(node, op2);
}
else if (isCommutative && IsContainableHWIntrinsicOp(node, op1, &supportsRegOptional))
else if ((isCommutative || (intrinsicId == NI_BMI2_MultiplyNoFlags) ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we just mark MultiplyNoFlags as commutative?

Copy link
Author

@fiigii fiigii Dec 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MultiplyNoFlags has 2-arg and 3-arg overloads both. Adding the commutative flag for MultiplyNoFlags will break some existing assumptions (e.g., if (numArgs == 3) assert(!isCommutative);). I think that is not worthwhile for this very special one.

MakeSrcContained(node, op1);
// MultiplyNoFlags is a Commutative operation, so swap the first two operands here
// to make the containment checks in codegen significantly simpler
*(originalArgList->pCurrent()) = op2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is similar to the FMA case and we could just mark either op1 or op2 as contained without switching here. We already have the logic in codegen to extract the appropriate registers. Also doing the swap for op1/op2 depending on which is contained should also be trivial to support there.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are reusing the existing code for 2-arg overload that may swap op1 and op2 in lowering. So, also swapping op1/op2 here for 3-arg version will unify the following code.

@fiigii
Copy link
Author

fiigii commented Dec 20, 2018

@dotnet-bot test this

@fiigii
Copy link
Author

fiigii commented Dec 20, 2018

@tannergooding Does this PR look good to you? CI seems stuck again, but it was all green...

@mikedn
Copy link

mikedn commented Dec 20, 2018

Again, you should fix bextr's control parameter, it should not be ushort, it should be int.

@fiigii
Copy link
Author

fiigii commented Dec 20, 2018

@mikedn Sorry to miss your comment. As I said above, the codegen issue is not specific to bextr.
For hardware intrinsic, we have decided to provide more "intuitive" APIs than C++, for example, we provide Sse41.Insert(Vector128<sbyte> value, sbyte data, byte index) instead of Sse41.Insert(Vector128<sbyte> value, int data, byte index). So, we have the consistency on bextr API as well. And I think we can address the movzx problem by optimizations (e.g., optNarrowTree that you suggested) for all the small integer APIs, not only bextr.

@tannergooding
Copy link
Member

Again, you should fix bextr's control parameter, it should not be ushort, it should be int.

This is not what was proposed/reviewed/approved by the framework API review team. If we want it changed and there is a strong/sufficient reason to have it changed, we need a separate API proposal requesting it. It will then need to be fast-tracked through the review process after everyone gets back from holiday.

However, I'm not sure that the API review team will opt for changing the API this late (cc. @terrajobst). It was already fairly extensively discussed whether we would expose int like the native APIs, or if we should expose byte (which better matches what the instructions actually support) and it was previously decided that we should try to lead users toward the pit of success by typing the immediate parameters appropriately (that is, if they only support inputs from 0-255, or less, we should only allow them to give inputs of 0-255).

Having byte/sbyte/short/ushort parameters is also not completely unusual and I would expect that it would be more broadly beneficial to ensure these cases generate good code.

@tannergooding
Copy link
Member

But if you do it and if you aren't capable of doing it properly then at least have some common sense and do the reasonable thing

The "reasonable" thing is not always obvious. It is generally the case that people have different views on the "correct" approach to designing an API or solving a problem. And sometimes that view is skewed by a persons experience or inexperience in a particular area. Not everyone is an expert on the runtime or on low-level code in general, but each person brings something unique and useful to the table, including our external contributors.

Even if it was, the same "excuse" would apply to BZHI as well. Guess what, that has an uint parameter.

Yes, this looks to be a case that we missed and we should aim for consistency. We might not have found it if it weren't for people looking at the surface area and pointing it out. The API surface area for the HWIntrinsics isn't exactly small and it has undergone a few revisions as people have used the previews and provided feedback.

I saw a problem and commented on it. Feel free to take it to the API review team.

I will try and log an issue around the concern you've raised here. However, I am not you and may not be able to 100% capture your viewpoint or correctly relay the comments your are making. I can only capture what I've interpreted your concern to be, and link back to the conversation here.

@mikedn
Copy link

mikedn commented Dec 20, 2018

Not everyone is an expert on the runtime or on low-level code in general

Again, if you do not know what you're doing then the best thing to do is not to wade into uncharted territories by inventing things that do not need to be invented. But no, let's deviate from typical .NET API conventions (most use int unless there's a good reason to other types), let's deviate from the native intrinsic design (because hey, we probably know better than those guys), let's deviate form the actual hw instruction (because staying as close as possible to the hw instruction wasn't a design mantra to begin with).

@fiigii
Copy link
Author

fiigii commented Dec 20, 2018

the same "excuse" would apply to BZHI as well. Guess what, that has an uint parameter.

@mikedn Thanks for pointing this out.

I will try and log an issue around the concern you've raised here.

I went over all the scalar ISA APIs, almost all the APIs operate over ulong/uint(except CRC32 that indeed needs small integers). Perhaps, we can keep them consistent to only use ulong/uint instead of the opposite way?
And we can treat scalar APIs separate from SIMD ones, which for example still let Insert take scalar parameters with its base-type.
@tannergooding @danmosemsft If just change this one API BitFieldExtract for API consistency, can we skip the formal API review? We already did the similar work just after 2.1 release.

@tannergooding
Copy link
Member

If just change this one API BitFieldExtract for API consistency, can we skip the formal API review? We already did the similar work just after 2.1 release.

We are much later in the release cycle now and we did follow up with all of the previous changes to get them affirmed. This change would be no different.

I dont think we will move anywhere actionable on this until after the holidays are over.

It would be beneficial, in the meantime, if we could get an issue opened tracking this. It might also be beneficial to list the scalar intrinsics, what they currently take as input arguments, and what the actual instruction takes (including limitations on the input range). -- I dont believe we have that many that operate directly on scalar values, so the list hopefully wont be too large

@fiigii
Copy link
Author

fiigii commented Dec 20, 2018

We are much later in the release cycle now and we did follow up with all of the previous changes to get them affirmed. This change would be no different.

I dont think we will move anywhere actionable on this until after the holidays are over.

Hmm, that is unfortunate to be late in the release cycle. Okay, in my opinion, let's keep the current design, then discuss this topic in the next API review if possible. Although the current BMI APIs has a bit inconsistency, other solutions also have disadvantages (e.g., make scalar APIs inconsistent from SIMD, or make some APIs less intuitive, etc.). This is a really small issue (its drawback can be solved by optimizations), it is not worth to block the whole feature release. Make sense?

@tannergooding
Copy link
Member

Okay, in my opinion, let's keep the current design, then discuss this topic in the next API review if possible

I agree. Any future API modifications shouldn't block this PR from going through.

Although the current BMI APIs has a bit inconsistency, other solutions also have disadvantages (e.g., make scalar APIs inconsistent from SIMD, or make some APIs less intuitive, etc.). This is a really small issue (its drawback can be solved by optimizations), it is not worth to block the whole feature release. Make sense?

This shouldn't put the feature at risk, and we should still have plenty of time to get the inconsistency reviewed and fixed via the appropriate process. I would not be in favor of making a decision here while everyone is on holiday, especially since this will likely require input from more people and some more context on the shape/limitations on the other APIs we've exposed.

@fiigii
Copy link
Author

fiigii commented Dec 20, 2018

Rebased to try CI testing again.

@fiigii
Copy link
Author

fiigii commented Dec 20, 2018

@dotnet-bot test Windows_NT x64 Checked Innerloop Build and Test
@dotnet-bot test Windows_NT x64 Checked CoreFX Tests

@tannergooding
Copy link
Member

@dotnet/dnceng, are there any known issues with the Arm/OSX machines right now? Seems the jobs keep losing connecting mid run.

@fiigii
Copy link
Author

fiigii commented Dec 20, 2018

And two Windows_NT x64 tests triggered but not started (too long queue?).

@Chrisboh
Copy link
Member

@tannergooding the power spikes have turned off all physical machines. There is an ongoing outage that is currently being worked on. Updates have been sent to the partners alias and we will continue to update that email thread when we have more.

@fiigii
Copy link
Author

fiigii commented Dec 20, 2018

Most jobs get green, @tannergooding can we ignore the CI stuck and the unrelated failure in "Windows_NT x64 Checked CoreFX Tests"?

@tannergooding
Copy link
Member

test Ubuntu arm Cross Checked crossgen_comparison Build and Test
test Ubuntu16.04 arm64 Cross Checked Innerloop Build and Test
test Ubuntu16.04 arm64 Cross Checked no_tiered_compilation_innerloop Build and Test

@tannergooding
Copy link
Member

can we ignore the CI stuck and the unrelated failure in "Windows_NT x64 Checked CoreFX Tests"?

I would like to see the Ubuntu arm jobs passing, as they've started working on the other PRs now.

The Windows_NT x64 Checked CoreFX Tests we can probably ignore, as it looks unrelated and is also failing on other PRs. I would like to see the OSX leg also pass, as it is our only OSX leg, but someone else might be able to comment if they feel it is fine to ignore.

@fiigii
Copy link
Author

fiigii commented Dec 21, 2018

Thanks!

@CarolEidt CarolEidt merged commit 7ac4a46 into dotnet:master Dec 21, 2018
@fiigii fiigii deleted the bmi branch December 21, 2018 22:11
Dotnet-GitSync-Bot pushed a commit to Dotnet-GitSync-Bot/corefx that referenced this pull request Dec 22, 2018
* Add tests for BMI1/2 intrinsic

* Implement the remaining BMI1/2 intrinsic

* Fix emitDispIns for BMI instruction

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
Dotnet-GitSync-Bot pushed a commit to Dotnet-GitSync-Bot/corert that referenced this pull request Dec 22, 2018
* Add tests for BMI1/2 intrinsic

* Implement the remaining BMI1/2 intrinsic

* Fix emitDispIns for BMI instruction

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
Dotnet-GitSync-Bot pushed a commit to Dotnet-GitSync-Bot/mono that referenced this pull request Dec 22, 2018
* Add tests for BMI1/2 intrinsic

* Implement the remaining BMI1/2 intrinsic

* Fix emitDispIns for BMI instruction

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
jkotas pushed a commit to dotnet/corefx that referenced this pull request Dec 22, 2018
* Add tests for BMI1/2 intrinsic

* Implement the remaining BMI1/2 intrinsic

* Fix emitDispIns for BMI instruction

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
jkotas pushed a commit to dotnet/corert that referenced this pull request Dec 22, 2018
* Add tests for BMI1/2 intrinsic

* Implement the remaining BMI1/2 intrinsic

* Fix emitDispIns for BMI instruction

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
marek-safar pushed a commit to mono/mono that referenced this pull request Dec 22, 2018
* Add tests for BMI1/2 intrinsic

* Implement the remaining BMI1/2 intrinsic

* Fix emitDispIns for BMI instruction

Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>
picenka21 pushed a commit to picenka21/runtime that referenced this pull request Feb 18, 2022
* Add tests for BMI1/2 intrinsic

* Implement the remaining BMI1/2 intrinsic

* Fix emitDispIns for BMI instruction


Commit migrated from dotnet/coreclr@7ac4a46
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

API Proposal: Add Intel hardware intrinsic functions and namespace Implement BMI1 and BMI2 intrinsics
7 participants