-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JIT: Added SVE APIs CreateMaskForFirstActiveElement
and CreateMaskForNextActiveElement
#104002
JIT: Added SVE APIs CreateMaskForFirstActiveElement
and CreateMaskForNextActiveElement
#104002
Conversation
Note regarding the
|
1 similar comment
Note regarding the
|
Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics |
@dotnet/arm64-contrib @kunalspathak this is ready. Current test failures are only in UnsafeRead and I believe it is due to the predicate callee-save register bug. Example assembly of one of the failures: G_M52469_IG41: ;; offset=0x00C0
ptrue p0.h
;; size=4 bbWeight=0.25 PerfScore 0.50
G_M52469_IG42: ;; offset=0x00C4
cmpne p0.h, p0/z, z16.h, #0
;; size=4 bbWeight=0.25 PerfScore 0.75
G_M52469_IG43: ;; offset=0x00C8
mov x0, x22
;; size=4 bbWeight=0.25 PerfScore 0.12
G_M52469_IG44: ;; offset=0x00CC
movz x1, #0x7DB0 // code for JIT.HardwareIntrinsics.Arm._Sve.SimpleBinaryOpTest__Sve_CreateMaskForFirstActiveElement_ushort+DataTable:get_inArray2Ptr():ulong:this
;; size=4 bbWeight=0.25 PerfScore 0.12
G_M52469_IG45: ;; offset=0x00D0
movk x1, #0xE8D0 LSL #16
;; size=4 bbWeight=0.25 PerfScore 0.12
G_M52469_IG46: ;; offset=0x00D4
movk x1, #0x7FFC LSL #32
;; size=4 bbWeight=0.25 PerfScore 0.12
G_M52469_IG47: ;; offset=0x00D8
ldr x1, [x1]
;; size=4 bbWeight=0.25 PerfScore 0.75
G_M52469_IG48: ;; offset=0x00DC
blr x1
;; size=4 bbWeight=0.25 PerfScore 0.25
G_M52469_IG49: ;; offset=0x00E0
ldr q16, [x0]
;; size=4 bbWeight=0.25 PerfScore 0.75
G_M52469_IG50: ;; offset=0x00E4
ptrue p1.h
;; size=4 bbWeight=0.25 PerfScore 0.50
G_M52469_IG51: ;; offset=0x00E8
cmpne p1.h, p1/z, z16.h, #0
;; size=4 bbWeight=0.25 PerfScore 0.75
G_M52469_IG52: ;; offset=0x00EC
pfirst p1.b, p0, p1.b
;; size=4 bbWeight=0.25 PerfScore 0.50 |
I don't believe the general callee save/trash support is hooked up at the moment and it's instead something @kunalspathak is working on enabling. |
break; | ||
} | ||
|
||
case NI_Sve_CreateMaskForNextActiveElement: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this one can you remove HW_Flag_SpecialCodeGen
and then everything here will be done by common code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It uses INS_sve_mov
. I could expand emitIns_Mov
to handle it though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to update the existing isRMW
code to handle these intrinsics? It takes care of mov
as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if we cannot do it, the body of each case is idential and should be shared.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can, but would need to update emitIns_Mov to be aware of predicate registers and use INS_sve_mov for them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
without SpecialCodeGen
flag, where in the current code do we go, where we call emitIns_Mov()
?
@@ -47,6 +47,8 @@ HARDWARE_INTRINSIC(Sve, CreateFalseMaskSingle, | |||
HARDWARE_INTRINSIC(Sve, CreateFalseMaskUInt16, -1, 0, false, {INS_invalid, INS_invalid, INS_invalid, INS_sve_pfalse, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid}, HW_Category_SIMD, HW_Flag_Scalable|HW_Flag_ReturnsPerElementMask) | |||
HARDWARE_INTRINSIC(Sve, CreateFalseMaskUInt32, -1, 0, false, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_sve_pfalse, INS_invalid, INS_invalid, INS_invalid, INS_invalid}, HW_Category_SIMD, HW_Flag_Scalable|HW_Flag_ReturnsPerElementMask) | |||
HARDWARE_INTRINSIC(Sve, CreateFalseMaskUInt64, -1, 0, false, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_sve_pfalse, INS_invalid, INS_invalid}, HW_Category_SIMD, HW_Flag_Scalable|HW_Flag_ReturnsPerElementMask) | |||
HARDWARE_INTRINSIC(Sve, CreateMaskForFirstActiveElement, -1, 2, true, {INS_sve_pfirst, INS_sve_pfirst, INS_sve_pfirst, INS_sve_pfirst, INS_sve_pfirst, INS_sve_pfirst, INS_sve_pfirst, INS_sve_pfirst, INS_invalid, INS_invalid}, HW_Category_SIMD, HW_Flag_Scalable|HW_Flag_ExplicitMaskedOperation|HW_Flag_LowMaskedOperation|HW_Flag_ReturnsPerElementMask|HW_Flag_LowMaskedOperation|HW_Flag_SpecialCodeGen|HW_Flag_HasRMWSemantics) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't need HW_Flag_LowMaskedOperation
for pfirst
.
@@ -47,6 +47,8 @@ HARDWARE_INTRINSIC(Sve, CreateFalseMaskSingle, | |||
HARDWARE_INTRINSIC(Sve, CreateFalseMaskUInt16, -1, 0, false, {INS_invalid, INS_invalid, INS_invalid, INS_sve_pfalse, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid}, HW_Category_SIMD, HW_Flag_Scalable|HW_Flag_ReturnsPerElementMask) | |||
HARDWARE_INTRINSIC(Sve, CreateFalseMaskUInt32, -1, 0, false, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_sve_pfalse, INS_invalid, INS_invalid, INS_invalid, INS_invalid}, HW_Category_SIMD, HW_Flag_Scalable|HW_Flag_ReturnsPerElementMask) | |||
HARDWARE_INTRINSIC(Sve, CreateFalseMaskUInt64, -1, 0, false, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_sve_pfalse, INS_invalid, INS_invalid}, HW_Category_SIMD, HW_Flag_Scalable|HW_Flag_ReturnsPerElementMask) | |||
HARDWARE_INTRINSIC(Sve, CreateMaskForFirstActiveElement, -1, 2, true, {INS_sve_pfirst, INS_sve_pfirst, INS_sve_pfirst, INS_sve_pfirst, INS_sve_pfirst, INS_sve_pfirst, INS_sve_pfirst, INS_sve_pfirst, INS_invalid, INS_invalid}, HW_Category_SIMD, HW_Flag_Scalable|HW_Flag_ExplicitMaskedOperation|HW_Flag_LowMaskedOperation|HW_Flag_ReturnsPerElementMask|HW_Flag_LowMaskedOperation|HW_Flag_SpecialCodeGen|HW_Flag_HasRMWSemantics) | |||
HARDWARE_INTRINSIC(Sve, CreateMaskForNextActiveElement, -1, 2, true, {INS_invalid, INS_sve_pnext, INS_invalid, INS_sve_pnext, INS_invalid, INS_sve_pnext, INS_invalid, INS_sve_pnext, INS_invalid, INS_invalid}, HW_Category_SIMD, HW_Flag_Scalable|HW_Flag_ExplicitMaskedOperation|HW_Flag_LowMaskedOperation|HW_Flag_ReturnsPerElementMask|HW_Flag_LowMaskedOperation|HW_Flag_SpecialCodeGen|HW_Flag_HasRMWSemantics) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here.
break; | ||
} | ||
|
||
case NI_Sve_CreateMaskForNextActiveElement: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if we cannot do it, the body of each case is idential and should be shared.
break; | ||
} | ||
|
||
case NI_Sve_CreateMaskForNextActiveElement: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
without SpecialCodeGen
flag, where in the current code do we go, where we call emitIns_Mov()
?
@@ -28,6 +28,346 @@ public static Vector<T> InitVector<T>(Func<int, T> f) | |||
return new Vector<T>(arr); | |||
} | |||
|
|||
public static byte[] CreateMaskForFirstActiveElement(byte[] mask, byte[] srcMask) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this can't be simple as:
public static byte[] CreateMaskForFirstActiveElement(byte[] mask, byte[] srcMask)
{
for (var i = 0; i < count; i++)
{
// For active lane, set the element
if (mask[i] != 0)
{
srcMask[i] = 1;
break;
}
}
return srcMask;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, wondering if we can templatize it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might be able to; but would need to do some reflection to dynamically handle zero and one for the primitives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'll get templatized when this all gets copied back into helpers.tt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am ok to not do templatize version if it is too complicated. We already have lot of helpers that do not use template version.
return -1; | ||
} | ||
|
||
public static byte[] CreateMaskForNextActiveElement(byte[] mask, byte[] srcMask) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be easier to undersand:
public static ushort[] CreateMaskForNextActiveElement(ushort[] mask, ushort[] srcMask)
{
int activeElement = 0;
int i;
// Find the active element in srcMask
for (i = 0; i < count; i++)
{
if (srcMask[i] == 0)
{
activeElement = i + 1;
break;
}
}
// Find the next active element in mask
var result = new byte[Vector<ushort>.Count];
for (i = activeElement; i < count; i++)
{
if (mask[i] == 1)
{
result[i] = 1;
break;
}
}
return result;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was pretty much verbatim copying the logic from the arm docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be easier to undersand:
Looks good to me (ditto for the simpler CreateMaskForFirstActiveElement
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The suggestion for CreateMaskForFirstActiveElement
mutates the srcMask
which we do not want to do. I'll cleanup these methods though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The suggestion for CreateMaskForFirstActiveElement mutates the srcMask which we do not want to do
does it matter for our testing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We explicitly do the same for xarch intrinsics for the same reason, a lot of the logic for these intrinsics is quite complex and we don't want to introduce risk by trying to do any cleanup as compared to what is actually documented to occur (as that can lead to subtle bugs that are difficult to track down).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does. I expect these helpers to not mutate any inputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The arm architecture manuals describe a behavior using pseudo-code already and so matching that as closely as possible in the test validation is general goodness and far more beneficial in practice than trying to make rewrite the validation APIs to some form of more aesthetically pleasing code.
That should be the intent, but simplifying the code really helps in understanding if we are even validating the right thing. It took me a while to understand the logic here. Plus, when we say "copying the logic", we are implementing it ourselves anyway, and that can have bugs as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but simplifying the code really helps in understanding if we are even validating the right thing
There's not really a need to understand what it's doing, the algorithm matches what is spec'd and that is sufficient for knowing its testing the right thing.
we are implementing it ourselves anyway, and that can have bugs as well.
Yes, and it is easier to validate that it is correct and matches the expected behavior when it is 1-to-1 (or as close to possible) with the architecture manual.
If we have to simplify the code, then every person needs to read the Architecture manual and work through the logic of simplifying it and understanding what the code is meant to be doing.
If we instead simply mirror the algorithm as closely as possible, it becomes far more trivial to do a comparison with the architecture manual and see that it effectively identical and therefore can be presumed correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correctness is by far the most important thing here, far more than perf or any other consideration.
If there was a significant need to document a particularly complex algorithm to make it easier for reviewers to understand, that is then a separate consideration and one that can be handled instead with an attached comment (in code) that describes the operation in "simpler" terms, which then gives us something that can be trivially compared with the source of truth while still providing additional context that makes it easier to understand what's happening.
It should already forward. I have an idea though how to make this work without modifying emitIns_Mov |
@dotnet/arm64-contrib @kunalspathak this is ready again. |
@@ -824,7 +824,25 @@ void CodeGen::genHWIntrinsic(GenTreeHWIntrinsic* node) | |||
assert(!node->IsEmbMaskOp()); | |||
if (HWIntrinsicInfo::IsExplicitMaskedOperation(intrin.id)) | |||
{ | |||
GetEmitter()->emitIns_R_R_R(ins, emitSize, targetReg, op1Reg, op2Reg, opt); | |||
if (isRMW) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you confirm in what cases we come at this code path vs. the one at the end of this file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CreateMaskForNextActiveElement
will come there, and any table-driven HW intrinsic that has two parameters with an explicit masked operation. But CreateMaskForNextActiveElement
is the first RMW to reach this path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why cannot NI_Sve_CreateMaskForFirstActiveElement
handled here itself after you remove the SpecialCodegen
flag for it. They both need INS_OPTS_SCALABLE_B
as opt. So if this code can work for
NI_Sve_CreateMaskForNextActiveElement
, wondering why can't it work for NI_Sve_CreateMaskForFirstActiveElement
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FirstActiveElement will as well, sorry I didn't include it in my comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see now. Yes, we shouldn't need the SpecialCodeGen for FirstActiveElement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I actually tried it out. So 'pnext' is PNEXT <Pdn>.<T>, <Pv>, <Pdn>.<T>
while 'pfirst' is PFIRST <Pdn>.B, <Pg>, <Pdn>.B
. So, yea, we actually need to do SpecialCodeGen for FirstActiveElement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hhm, not sure I am still following. Since FirstActiveElement
code that you have is:
assert(isRMW);
assert(HWIntrinsicInfo::IsExplicitMaskedOperation(intrin.id));
if (targetReg != op2Reg)
{
assert(targetReg != op1Reg);
GetEmitter()->emitIns_Mov(INS_sve_mov, emitTypeSize(node), targetReg, op2Reg, /* canSkip */ true);
}
GetEmitter()->emitIns_R_R(ins, emitSize, targetReg, op1Reg, INS_OPTS_SCALABLE_B);
break;
If we delete that, it will come on line 846, check it is isRMW
and go inside if
block and call emitIns_R_R()
. Unless I am missing something major here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From the line you are talking about, when the opt
gets passed to emitIns_R_R
, the opt
is not guaranteed to be INS_OPTS_SCALABLE_B
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, that's where the confusion was on my part. I was looking at the unit tests and thought they both take just INS_OPTS_SCALABLE_B
, but looking again at https://docsmirror.github.io/A64/2023-06/pnext_p_p_p.html, it can get other values as well. Makes sense now. Thanks!
runtime/src/coreclr/jit/codegenarm64test.cpp
Lines 5047 to 5048 in 7f7bb7b
theEmitter->emitIns_R_R(INS_sve_pnext, EA_SCALABLE, REG_P0, REG_P15, | |
INS_OPTS_SCALABLE_B); // PNEXT <Pdn>.<T>, <Pv>, <Pdn>.<T> |
runtime/src/coreclr/jit/codegenarm64test.cpp
Lines 5034 to 5035 in 7f7bb7b
theEmitter->emitIns_R_R(INS_sve_pfirst, EA_SCALABLE, REG_P0, REG_P15, | |
INS_OPTS_SCALABLE_B); // PFIRST <Pdn>.B, <Pg>, <Pdn>.B |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was looking at the unit tests and thought they both take just INS_OPTS_SCALABLE_B
I did the same thing. :)
@@ -47,6 +47,8 @@ HARDWARE_INTRINSIC(Sve, CreateFalseMaskSingle, | |||
HARDWARE_INTRINSIC(Sve, CreateFalseMaskUInt16, -1, 0, false, {INS_invalid, INS_invalid, INS_invalid, INS_sve_pfalse, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid}, HW_Category_SIMD, HW_Flag_Scalable|HW_Flag_ReturnsPerElementMask) | |||
HARDWARE_INTRINSIC(Sve, CreateFalseMaskUInt32, -1, 0, false, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_sve_pfalse, INS_invalid, INS_invalid, INS_invalid, INS_invalid}, HW_Category_SIMD, HW_Flag_Scalable|HW_Flag_ReturnsPerElementMask) | |||
HARDWARE_INTRINSIC(Sve, CreateFalseMaskUInt64, -1, 0, false, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_sve_pfalse, INS_invalid, INS_invalid}, HW_Category_SIMD, HW_Flag_Scalable|HW_Flag_ReturnsPerElementMask) | |||
HARDWARE_INTRINSIC(Sve, CreateMaskForFirstActiveElement, -1, 2, true, {INS_sve_pfirst, INS_sve_pfirst, INS_sve_pfirst, INS_sve_pfirst, INS_sve_pfirst, INS_sve_pfirst, INS_sve_pfirst, INS_sve_pfirst, INS_invalid, INS_invalid}, HW_Category_SIMD, HW_Flag_Scalable|HW_Flag_ExplicitMaskedOperation|HW_Flag_ReturnsPerElementMask|HW_Flag_SpecialCodeGen|HW_Flag_HasRMWSemantics) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need SpecialCodeGen
for this. They should be handled the same way CreateMaskForNextActiveElement
is handled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to in order to pass INS_OPTS_SCALABLE_B
as part of emitting the instruction.
@kunalspathak this is ready. Now using the ins_mov helper. |
@@ -824,7 +824,25 @@ void CodeGen::genHWIntrinsic(GenTreeHWIntrinsic* node) | |||
assert(!node->IsEmbMaskOp()); | |||
if (HWIntrinsicInfo::IsExplicitMaskedOperation(intrin.id)) | |||
{ | |||
GetEmitter()->emitIns_R_R_R(ins, emitSize, targetReg, op1Reg, op2Reg, opt); | |||
if (isRMW) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why cannot NI_Sve_CreateMaskForFirstActiveElement
handled here itself after you remove the SpecialCodegen
flag for it. They both need INS_OPTS_SCALABLE_B
as opt. So if this code can work for
NI_Sve_CreateMaskForNextActiveElement
, wondering why can't it work for NI_Sve_CreateMaskForFirstActiveElement
?
@kunalspathak changed the Ins_cpy. So I guess this is ready again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Contributes to #99957
Adds:
CreateMaskForFirstActiveElement
CreateMaskForNextActiveElement