Implement the remaining BMI1/2 intrinsic #21480

fiigii · 2018-12-11T00:34:32Z

This PR implements the remaining BMI1/2 intrinsic.

After this PR, BMI1 and BMI2 get fully implemented and all the intrinsic APIs proposed in https://github.com/dotnet/corefx/issues/22940 are enabled for .NET Core 3.0.

Close https://github.com/dotnet/coreclr/issues/18712 and close https://github.com/dotnet/corefx/issues/22940

@tannergooding @CarolEidt

fiigii · 2018-12-11T00:39:16Z

src/System.Private.CoreLib/shared/System/Runtime/Intrinsics/X86/Bmi1.cs

@@ -35,7 +35,7 @@ public abstract class X64
            ///   BEXTR r64a, reg/m64, r64b
            /// This intrinisc is only available on 64-bit processes
            /// </summary>
-            public static ulong BitFieldExtract(ulong value, byte start, byte length) => BitFieldExtract(value, start, length);
+            public static ulong BitFieldExtract(ulong value, byte start, byte length) => BitFieldExtract(value, (ushort)(start | (length << 8)));


Implement the 3-arg version in managed code, which significantly simplified containment codegen.

How does codegen look for this intrinsic? Such a combination of small integer type might result in such messy codegen that the intrinsic would be practically useless for non-constant start and length.

Yes, there are some inefficient codegen from small integer

IN0001: 000009 mov rcx, 0x1F272F13200 IN0002: 000013 mov rcx, gword ptr [rcx] IN0003: 000016 call TestFramework:BeginScenario(ref) IN0004: 00001B mov rcx, 0x7FF905B8EFB8 IN0005: 000025 call CORINFO_HELP_NEWSFAST IN0006: 00002A mov rdi, rax IN0007: 00002D mov rcx, rdi IN0008: 000030 call ScalarTernOpTest__BitFieldExtractUInt64:.ctor():this IN0009: 000035 mov rdx, qword ptr [rdi+8] IN000a: 000039 mov r8, rdx IN000b: 00003C movzx r9, byte ptr [rdi+16] IN000c: 000041 mov ecx, r9d IN000d: 000044 movzx rax, byte ptr [rdi+17] IN000e: 000048 mov r10d, eax IN000f: 00004B shl r10d, 8 IN0010: 00004F or ecx, r10d IN0011: 000052 movzx rcx, cx IN0012: 000055 vbextr r8, rcx, r8 IN0013: 00005A mov qword ptr [V03+0x20 rsp+20H], r8 IN0014: 00005F mov r8, 0x1F272F13200 IN0015: 000069 mov r8, gword ptr [r8] IN0016: 00006C mov gword ptr [V03+0x28 rsp+28H], r8 IN0017: 000071 mov r8d, r9d IN0018: 000074 mov r9d, eax IN0019: 000077 mov rcx, rsi IN001a: 00007A call ScalarTernOpTest__BitFieldExtractUInt64:ValidateResult(long,ubyte,ubyte,long,ref):this IN001b: 00007F nop

Hmm, it doesn't look too bad. I'm not sure what's up with IN000a and IN000c but they look like artifacts of a rather fancy test setup, you won't get those if you just use local variables. The only real problem seems to be the useless IN0011, that's a consequence of the unfortunately typed control parameter. I think a properly implemented version of JIT's optNarrowTree should be able to get rid of that extra cast.

The only real problem seems to be the useless IN0011,

Right, I think so. The problem is that JIT does not know the following consumer is a truncation and should eliminate the promotion. This issue is not specific to this intrinsic, we have detected it from other ones (e.g., Sse41.Insert).

Seems related to https://github.com/dotnet/coreclr/issues/13210

Note, JIT prints the disasm incorrectly, that actually should be

0: 41 c1 e2 08 shl r10d,0x8 4: 41 0b ca or ecx,r10d 7: 0f b7 c9 movzx ecx,cx a: c4 42 f0 f7 c0 bextr r8,r8,rcx

Note, JIT prints the disasm incorrectly

Could you call this one out on https://github.com/dotnet/coreclr/issues/21441?

I will fix the new instruction print in this PR, logged movzx in #21441.

I will fix the new instruction print in this PR,

Done.

The problem is that JIT does not know the following consumer is a truncation and should eliminate the promotion

I wouldn't expect the JIT to start understanding such intrinsic details. Instead, it should recognize that the expression produces a value where the upper 16 bits are already 0. Though of course, it would have been easier if this problem wasn't introduced in the first place by using the wrong parameter type.

Seems related to #13210

Sort of, but that's more specific and the fix I have for that does not involve optNarrowTree.

src/jit/hwintrinsiccodegenxarch.cpp

fiigii · 2018-12-11T01:20:07Z

@tannergooding Do you where is the CoreFX update PR? I did find it...

tannergooding · 2018-12-11T01:27:00Z

Do you where is the CoreFX update PR? I did find it...

#21374 looks to be the last PR that updated CoreFX, which looks to be from the same day we merged the API changes. The next one should contain the fixups, but it looks like the build might be blocked.

fiigii · 2018-12-14T18:55:00Z

@dotnet-bot test this please

fiigii · 2018-12-14T20:54:07Z

@dotnet-bot test Ubuntu x64 Checked CoreFX Tests
@dotnet-bot test Windows_NT x86 Release Innerloop Build and Test

fiigii · 2018-12-14T20:55:35Z

This PR is ready for final review. @CarolEidt @tannergooding @mikedn PTAL

mikedn · 2018-12-14T22:19:54Z

You really should fix the type of BitFieldExtract's start/length/control parameters. There's no reason for those to be byte/ushort.

CarolEidt

LGTM overall, with one suggested assert and some comments.

src/jit/emitxarch.cpp

src/jit/gentree.cpp

src/jit/hwintrinsiccodegenxarch.cpp

src/jit/hwintrinsiclistxarch.h

src/jit/hwintrinsiccodegenxarch.cpp

src/jit/lowerxarch.cpp

fiigii · 2018-12-15T01:04:27Z

@CarolEidt Thank you for the comments. Addressed feedback.

fiigii · 2018-12-17T21:41:05Z

@tannergooding ping?

src/jit/gentree.cpp

CarolEidt

LGTM with on more assert request and one remaining typo to fix.

src/jit/hwintrinsiccodegenxarch.cpp

tannergooding · 2018-12-18T17:13:54Z

src/jit/codegen.h

@@ -931,6 +931,8 @@ XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
    void genHWIntrinsic_R_RM(GenTreeHWIntrinsic* node, instruction ins, emitAttr attr);
    void genHWIntrinsic_R_RM_I(GenTreeHWIntrinsic* node, instruction ins, int8_t ival);
    void genHWIntrinsic_R_R_RM(GenTreeHWIntrinsic* node, instruction ins, emitAttr attr);


Might it be worthwhile to just get rid of this one and have the callsites updated to call the new overload directly?

The original overload is called by several places, that would duplicate some code without the wrapper overload. I will refactor some HW intrinsic code later, that may be a good time to reconsider this.

src/jit/emitxarch.cpp

src/jit/hwintrinsiccodegenxarch.cpp

tannergooding · 2018-12-18T17:19:41Z

src/jit/lowerxarch.cpp

@@ -2835,7 +2835,9 @@ void Lowering::ContainCheckHWIntrinsic(GenTreeHWIntrinsic* node)
                    {
                        MakeSrcContained(node, op2);
                    }
-                    else if (isCommutative && IsContainableHWIntrinsicOp(node, op1, &supportsRegOptional))
+                    else if ((isCommutative || (intrinsicId == NI_BMI2_MultiplyNoFlags) ||


Why can't we just mark MultiplyNoFlags as commutative?

MultiplyNoFlags has 2-arg and 3-arg overloads both. Adding the commutative flag for MultiplyNoFlags will break some existing assumptions (e.g., if (numArgs == 3) assert(!isCommutative);). I think that is not worthwhile for this very special one.

tannergooding · 2018-12-18T17:24:13Z

src/jit/lowerxarch.cpp

+                                    MakeSrcContained(node, op1);
+                                    // MultiplyNoFlags is a Commutative operation, so swap the first two operands here
+                                    // to make the containment checks in codegen significantly simpler
+                                    *(originalArgList->pCurrent())         = op2;


I think this is similar to the FMA case and we could just mark either op1 or op2 as contained without switching here. We already have the logic in codegen to extract the appropriate registers. Also doing the swap for op1/op2 depending on which is contained should also be trivial to support there.

We are reusing the existing code for 2-arg overload that may swap op1 and op2 in lowering. So, also swapping op1/op2 here for 3-arg version will unify the following code.

fiigii · 2018-12-20T02:13:09Z

@dotnet-bot test this

fiigii · 2018-12-20T03:09:25Z

@tannergooding Does this PR look good to you? CI seems stuck again, but it was all green...

mikedn · 2018-12-20T03:18:05Z

Again, you should fix bextr's control parameter, it should not be ushort, it should be int.

fiigii · 2018-12-20T03:30:05Z

@mikedn Sorry to miss your comment. As I said above, the codegen issue is not specific to bextr.
For hardware intrinsic, we have decided to provide more "intuitive" APIs than C++, for example, we provide Sse41.Insert(Vector128<sbyte> value, sbyte data, byte index) instead of Sse41.Insert(Vector128<sbyte> value, int data, byte index). So, we have the consistency on bextr API as well. And I think we can address the movzx problem by optimizations (e.g., optNarrowTree that you suggested) for all the small integer APIs, not only bextr.

tannergooding · 2018-12-20T03:37:04Z

Again, you should fix bextr's control parameter, it should not be ushort, it should be int.

This is not what was proposed/reviewed/approved by the framework API review team. If we want it changed and there is a strong/sufficient reason to have it changed, we need a separate API proposal requesting it. It will then need to be fast-tracked through the review process after everyone gets back from holiday.

However, I'm not sure that the API review team will opt for changing the API this late (cc. @terrajobst). It was already fairly extensively discussed whether we would expose int like the native APIs, or if we should expose byte (which better matches what the instructions actually support) and it was previously decided that we should try to lead users toward the pit of success by typing the immediate parameters appropriately (that is, if they only support inputs from 0-255, or less, we should only allow them to give inputs of 0-255).

Having byte/sbyte/short/ushort parameters is also not completely unusual and I would expect that it would be more broadly beneficial to ensure these cases generate good code.

tannergooding · 2018-12-20T06:38:20Z

But if you do it and if you aren't capable of doing it properly then at least have some common sense and do the reasonable thing

The "reasonable" thing is not always obvious. It is generally the case that people have different views on the "correct" approach to designing an API or solving a problem. And sometimes that view is skewed by a persons experience or inexperience in a particular area. Not everyone is an expert on the runtime or on low-level code in general, but each person brings something unique and useful to the table, including our external contributors.

Even if it was, the same "excuse" would apply to BZHI as well. Guess what, that has an uint parameter.

Yes, this looks to be a case that we missed and we should aim for consistency. We might not have found it if it weren't for people looking at the surface area and pointing it out. The API surface area for the HWIntrinsics isn't exactly small and it has undergone a few revisions as people have used the previews and provided feedback.

I saw a problem and commented on it. Feel free to take it to the API review team.

I will try and log an issue around the concern you've raised here. However, I am not you and may not be able to 100% capture your viewpoint or correctly relay the comments your are making. I can only capture what I've interpreted your concern to be, and link back to the conversation here.

mikedn · 2018-12-20T06:57:49Z

Not everyone is an expert on the runtime or on low-level code in general

Again, if you do not know what you're doing then the best thing to do is not to wade into uncharted territories by inventing things that do not need to be invented. But no, let's deviate from typical .NET API conventions (most use int unless there's a good reason to other types), let's deviate from the native intrinsic design (because hey, we probably know better than those guys), let's deviate form the actual hw instruction (because staying as close as possible to the hw instruction wasn't a design mantra to begin with).

fiigii · 2018-12-20T07:25:47Z

the same "excuse" would apply to BZHI as well. Guess what, that has an uint parameter.

@mikedn Thanks for pointing this out.

I will try and log an issue around the concern you've raised here.

I went over all the scalar ISA APIs, almost all the APIs operate over ulong/uint(except CRC32 that indeed needs small integers). Perhaps, we can keep them consistent to only use ulong/uint instead of the opposite way?
And we can treat scalar APIs separate from SIMD ones, which for example still let Insert take scalar parameters with its base-type.
@tannergooding @danmosemsft If just change this one API BitFieldExtract for API consistency, can we skip the formal API review? We already did the similar work just after 2.1 release.

tannergooding · 2018-12-20T07:35:28Z

If just change this one API BitFieldExtract for API consistency, can we skip the formal API review? We already did the similar work just after 2.1 release.

We are much later in the release cycle now and we did follow up with all of the previous changes to get them affirmed. This change would be no different.

I dont think we will move anywhere actionable on this until after the holidays are over.

It would be beneficial, in the meantime, if we could get an issue opened tracking this. It might also be beneficial to list the scalar intrinsics, what they currently take as input arguments, and what the actual instruction takes (including limitations on the input range). -- I dont believe we have that many that operate directly on scalar values, so the list hopefully wont be too large

fiigii · 2018-12-20T08:01:31Z

We are much later in the release cycle now and we did follow up with all of the previous changes to get them affirmed. This change would be no different.

I dont think we will move anywhere actionable on this until after the holidays are over.

Hmm, that is unfortunate to be late in the release cycle. Okay, in my opinion, let's keep the current design, then discuss this topic in the next API review if possible. Although the current BMI APIs has a bit inconsistency, other solutions also have disadvantages (e.g., make scalar APIs inconsistent from SIMD, or make some APIs less intuitive, etc.). This is a really small issue (its drawback can be solved by optimizations), it is not worth to block the whole feature release. Make sense?

tannergooding · 2018-12-20T16:21:31Z

Okay, in my opinion, let's keep the current design, then discuss this topic in the next API review if possible

I agree. Any future API modifications shouldn't block this PR from going through.

Although the current BMI APIs has a bit inconsistency, other solutions also have disadvantages (e.g., make scalar APIs inconsistent from SIMD, or make some APIs less intuitive, etc.). This is a really small issue (its drawback can be solved by optimizations), it is not worth to block the whole feature release. Make sense?

This shouldn't put the feature at risk, and we should still have plenty of time to get the inconsistency reviewed and fixed via the appropriate process. I would not be in favor of making a decision here while everyone is on holiday, especially since this will likely require input from more people and some more context on the shape/limitations on the other APIs we've exposed.

fiigii · 2018-12-20T19:06:47Z

Rebased to try CI testing again.

fiigii · 2018-12-20T22:12:23Z

@dotnet-bot test Windows_NT x64 Checked Innerloop Build and Test
@dotnet-bot test Windows_NT x64 Checked CoreFX Tests

tannergooding · 2018-12-20T22:14:08Z

@dotnet/dnceng, are there any known issues with the Arm/OSX machines right now? Seems the jobs keep losing connecting mid run.

fiigii · 2018-12-20T22:17:24Z

And two Windows_NT x64 tests triggered but not started (too long queue?).

Chrisboh · 2018-12-20T22:44:40Z

@tannergooding the power spikes have turned off all physical machines. There is an ongoing outage that is currently being worked on. Updates have been sent to the partners alias and we will continue to update that email thread when we have more.

fiigii · 2018-12-20T23:50:26Z

Most jobs get green, @tannergooding can we ignore the CI stuck and the unrelated failure in "Windows_NT x64 Checked CoreFX Tests"?

tannergooding · 2018-12-21T00:02:46Z

test Ubuntu arm Cross Checked crossgen_comparison Build and Test
test Ubuntu16.04 arm64 Cross Checked Innerloop Build and Test
test Ubuntu16.04 arm64 Cross Checked no_tiered_compilation_innerloop Build and Test

tannergooding · 2018-12-21T00:04:37Z

can we ignore the CI stuck and the unrelated failure in "Windows_NT x64 Checked CoreFX Tests"?

I would like to see the Ubuntu arm jobs passing, as they've started working on the other PRs now.

The Windows_NT x64 Checked CoreFX Tests we can probably ignore, as it looks unrelated and is also failing on other PRs. I would like to see the OSX leg also pass, as it is our only OSX leg, but someone else might be able to comment if they feel it is fine to ignore.

fiigii · 2018-12-21T00:05:38Z

Thanks!

* Add tests for BMI1/2 intrinsic * Implement the remaining BMI1/2 intrinsic * Fix emitDispIns for BMI instruction Signed-off-by: dotnet-bot <dotnet-bot@microsoft.com>

* Add tests for BMI1/2 intrinsic * Implement the remaining BMI1/2 intrinsic * Fix emitDispIns for BMI instruction Commit migrated from dotnet/coreclr@7ac4a46

fiigii commented Dec 11, 2018

View reviewed changes

tannergooding reviewed Dec 11, 2018

View reviewed changes

src/jit/hwintrinsiccodegenxarch.cpp Outdated Show resolved Hide resolved

fiigii force-pushed the bmi branch from 592e409 to fd0cfe4 Compare December 11, 2018 01:15

fiigii force-pushed the bmi branch from fd0cfe4 to ab3fa55 Compare December 11, 2018 21:50

fiigii changed the title ~~[WIP] Implement the remaining BMI1/2 intrinsic~~ Implement the remaining BMI1/2 intrinsic Dec 14, 2018

CarolEidt reviewed Dec 14, 2018

View reviewed changes

fiigii force-pushed the bmi branch from 476ae99 to 18a96de Compare December 14, 2018 23:21

sixlettervariables reviewed Dec 18, 2018

View reviewed changes

src/jit/gentree.cpp Outdated Show resolved Hide resolved

CarolEidt reviewed Dec 18, 2018

View reviewed changes

src/jit/hwintrinsiccodegenxarch.cpp Show resolved Hide resolved

tannergooding reviewed Dec 18, 2018

View reviewed changes

src/jit/emitxarch.cpp Outdated Show resolved Hide resolved

tannergooding reviewed Dec 18, 2018

View reviewed changes

src/jit/hwintrinsiccodegenxarch.cpp Show resolved Hide resolved

tannergooding reviewed Dec 18, 2018

View reviewed changes

fiigii force-pushed the bmi branch from 18a96de to 9aeb22b Compare December 18, 2018 19:18

fiigii closed this Dec 19, 2018

fiigii reopened this Dec 19, 2018

FeiPengIntel added 3 commits December 20, 2018 11:02

Add tests for BMI1/2 intrinsic

77e9257

Implement the remaining BMI1/2 intrinsic

ef02789

Fix emitDispIns for BMI instruction

b8b99e1

fiigii force-pushed the bmi branch from 9aeb22b to b8b99e1 Compare December 20, 2018 19:05

CarolEidt approved these changes Dec 20, 2018

View reviewed changes

CarolEidt merged commit 7ac4a46 into dotnet:master Dec 21, 2018

fiigii deleted the bmi branch December 21, 2018 22:11

pentp mentioned this pull request Jan 31, 2020

Bmi2.MultiplyNoFlags issues dotnet/runtime#11782

Open

Implement the remaining BMI1/2 intrinsic #21480

Implement the remaining BMI1/2 intrinsic #21480

Conversation

fiigii commented Dec 11, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fiigii commented Dec 11, 2018

tannergooding commented Dec 11, 2018

fiigii commented Dec 14, 2018

fiigii commented Dec 14, 2018

fiigii commented Dec 14, 2018

mikedn commented Dec 14, 2018

CarolEidt left a comment

Choose a reason for hiding this comment

fiigii commented Dec 15, 2018

fiigii commented Dec 17, 2018

CarolEidt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fiigii Dec 18, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fiigii commented Dec 20, 2018

fiigii commented Dec 20, 2018

mikedn commented Dec 20, 2018

fiigii commented Dec 20, 2018

tannergooding commented Dec 20, 2018

tannergooding commented Dec 20, 2018

mikedn commented Dec 20, 2018

fiigii commented Dec 20, 2018

tannergooding commented Dec 20, 2018

fiigii commented Dec 20, 2018

tannergooding commented Dec 20, 2018

fiigii commented Dec 20, 2018

fiigii commented Dec 20, 2018

tannergooding commented Dec 20, 2018

fiigii commented Dec 20, 2018

Chrisboh commented Dec 20, 2018

fiigii commented Dec 20, 2018

tannergooding commented Dec 21, 2018

tannergooding commented Dec 21, 2018

fiigii commented Dec 21, 2018

fiigii commented Dec 11, 2018 •

edited

Loading

fiigii Dec 18, 2018 •

edited

Loading