-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARM64 intrinsic support for Vector64.Create() and Vector128.Create() #35590
ARM64 intrinsic support for Vector64.Create() and Vector128.Create() #35590
Conversation
@dotnet/jit-contrib , @tannergooding |
result = Arm.AdvSimd.Insert(result, 5, e5); | ||
result = Arm.AdvSimd.Insert(result, 6, e6); | ||
result = Arm.AdvSimd.Insert(result, 7, e7); | ||
result = Arm.AdvSimd.Insert(result, 8, e8); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not for this PR, but since we only have 8 argument registers the rest will be passed on the stack.
In which case would be easier to load the remaining 64 bits into a register directly from the stack into the top part of the 128-bit vector.
i.e. do something like this:
and w0, w0, 255
fmov s0, w0
ins v0.b[1], w1
ins v0.b[2], w2
ins v0.b[3], w3
ins v0.b[4], w4
ins v0.b[5], w5
ins v0.b[6], w6
ins v0.b[7], w7
ld1 {v0.d}[1], [sp]
And you can do this is something like this for a lot of the cases with more than 8 arguments. Do you know what the code generates here at the moment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point which I didn't realize. I will explore your suggestion.
Generated code
53001C00 uxtb w0, w0
4E011C10 ins v16.b[0], w0
53001C20 uxtb w0, w1
4E031C10 ins v16.b[1], w0
53001C40 uxtb w0, w2
4E051C10 ins v16.b[2], w0
53001C60 uxtb w0, w3
4E071C10 ins v16.b[3], w0
53001C80 uxtb w0, w4
4E091C10 ins v16.b[4], w0
53001CA0 uxtb w0, w5
4E0B1C10 ins v16.b[5], w0
53001CC0 uxtb w0, w6
4E0D1C10 ins v16.b[6], w0
53001CE0 uxtb w0, w7
4E0F1C10 ins v16.b[7], w0
B94023A0 ldr w0, [fp,#32] // [V08 arg8]
53001C00 uxtb w0, w0
4E111C10 ins v16.b[8], w0
B9402BA0 ldr w0, [fp,#40] // [V09 arg9]
53001C00 uxtb w0, w0
4E131C10 ins v16.b[9], w0
B94033A0 ldr w0, [fp,#48] // [V10 arg10]
53001C00 uxtb w0, w0
4E151C10 ins v16.b[10], w0
B9403BA0 ldr w0, [fp,#56] // [V11 arg11]
53001C00 uxtb w0, w0
4E171C10 ins v16.b[11], w0
B94043A0 ldr w0, [fp,#64] // [V12 arg12]
53001C00 uxtb w0, w0
4E191C10 ins v16.b[12], w0
B9404BA0 ldr w0, [fp,#72] // [V13 arg13]
53001C00 uxtb w0, w0
4E1B1C10 ins v16.b[13], w0
B94053A0 ldr w0, [fp,#80] // [V14 arg14]
53001C00 uxtb w0, w0
4E1D1C10 ins v16.b[14], w0
B9405BA0 ldr w0, [fp,#88] // [V15 arg15]
53001C00 uxtb w0, w0
4EB01E08 mov v8.16b, v16.16b
4E1F1C08 ins v8.b[15], w0
D28D0500 movz x0, #0x6828
F2A6EF60 movk x0, #0x377b LSL #16
F2CFFFA0 movk x0, #0x7ffd LSL #32
6E084509 mov v9.d[0], v8.d[1]
97FF53EA bl CORINFO_HELP_NEWSFAST
6E180528 mov v8.d[1], v9.d[0]
3C808008 str q8, [x0,#8]
And you can do this is something like this for a lot of the cases with more than 8 arguments.
Actually, this is the only API that has more than 8 arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder how hard is goint to be to get rid of the uxtb-s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I am not much familiar with this, but would love to get rid of them. Just to call out, they show up for byte
, sbyte
, short
and ushort
parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, I am not much familiar with this, but would love to get rid of them. Just to call out, they show up for byte, sbyte, short and ushort parameters.
Correct, because these types are usually "normalized" (i.e. sign- or zero-extended) to a 32 bit value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, those aren't needed indeed, would be good to get rid of them if possible since they double the number of instructions.
Also
B94053A0 ldr w0, [fp,#80] // [V14 arg14]
53001C00 uxtb w0, w0
4E1D1C10 ins v16.b[14], w0
Without the optimization I talked about above should ideally be
ld1 {v16.b[14]}, [x1]
where x1
gets incremented, or you could use a which prevents from having to move between register files. But also don't know how easy this is to do..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I briefly discussed various options with @echesakovMSFT and concluded that it might not be straight forward. I would like to revisit it after we implement other APIs. Opened #35688 to track it.
src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector64.cs
Outdated
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector128.cs
Show resolved
Hide resolved
src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector128.cs
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks Good to me with some questions/suggestions.
src/libraries/System.Private.CoreLib/src/System/Runtime/Intrinsics/Vector128.cs
Show resolved
Hide resolved
@@ -752,6 +763,26 @@ public static unsafe Vector128<byte> Create(byte e0, byte e1, byte e2, byte e3, | |||
return Sse2.UnpackLow(lo64, hi64).AsByte(); // < 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 > | |||
} | |||
|
|||
if (AdvSimd.IsSupported) | |||
{ | |||
Vector128<byte> result = CreateScalarUnsafe(e0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kunalspathak You might want to add comment here in the same fashion as it's done for Sse2 case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spoke offline and this is not needed.
1a86bc1
to
76cc65e
Compare
@tannergooding and @echesakovMSFT - Just FYI, while working on something else I found 2 issues for which we don't have test coverage today.
I will fix and add test coverage for it before merging. |
Added hardware intrinsic for various overloads of
Vector64.Create()
andVector128.Create()
:AdvSimd.Insert
.dup
/mov
/fmov
instructions.While I was there, I noticed an edge case where we hit assert if trying to emit an immediate
int.MaxValue
. Fixed it as well.Contributes to #33308 and #33496.
Fixes: #35821