Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[JitStress] VectorTableLookupExtension fails in jitstress1 #84696

Closed
tannergooding opened this issue Apr 12, 2023 · 9 comments
Closed

[JitStress] VectorTableLookupExtension fails in jitstress1 #84696

tannergooding opened this issue Apr 12, 2023 · 9 comments
Assignees
Labels
arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Milestone

Comments

@tannergooding
Copy link
Member

For:

set DOTNET_TieredCompilation=0
set DOTNET_JitStress=1

We see

14:38:16.844 Running test: _AdvSimd_Arm64_ro::JIT.HardwareIntrinsics.Arm._AdvSimd.Arm64.Program.VectorLookupExtension_2_Byte()
Beginning scenario: RunBasicScenario_UnsafeRead
Arm64.VectorTableLookupExtension<Byte>(Vector128<Byte>, (Vector128<Byte>, Vector128<Byte>>), Vector128<Byte>): RunBasicScenario_UnsafeRead failed:
 defaultValues: (31, 10, 93, 216, 169, 113, 148, 77, 204, 243, 166, 115, 27, 230, 43, 78)
 firstOp: (213, 126, 207, 229, 250, 214, 46, 187, 23, 195, 145, 45, 18, 53, 73, 179)
 secondOp: (198, 236, 120, 37, 38, 35, 212, 247, 230, 73, 19, 125, 181, 45, 221, 145)
 indices: (6, 1, 19, 0, 35, 20, 30, 20, 2, 4, 26, 23, 4, 35, 39, 31)
 result: (46, 126, 37, 213, 169, 38, 0, 38, 207, 250, 0, 247, 250, 230, 43, 0)

and

14:40:31.521 Running test: _AdvSimd_ro::JIT.HardwareIntrinsics.Arm._AdvSimd.Program.VectorLookupExtension_3_SByte()
Beginning scenario: RunBasicScenario_UnsafeRead
AdvSimd.VectorTableLookupExtension<SByte>(Vector64<SByte>, (Vector128<SByte>, Vector128<SByte>, Vector128<SByte>), Vector64<SByte>): RunBasicScenario_UnsafeRead failed:
 defaultValues: (126, 61, 97, 86, 65, 115, 19, 72)
 firstOp: (50, 13, 101, 20, 25, 90, 52, 33, 89, 27, 13, 109, 110, 112, 63, 98)
 secondOp: (122, 101, 116, 73, 82, 67, 54, 84, 40, 24, 93, 83, 77, 8, 37, 50)
 thirdOp: (33, 79, 16, 13, 33, 72, 74, 123, 89, 26, 63, 115, 96, 15, 59, 32)
 indices: (44, 36, 7, 43, 4, 12, 56, 11)
 result: (0, 33, 33, 0, 25, 0, 19, 0)
@tannergooding tannergooding added arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI labels Apr 12, 2023
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Apr 12, 2023
@tannergooding
Copy link
Member Author

CC. @kunalspathak

@ghost
Copy link

ghost commented Apr 12, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak
See info in area-owners.md if you want to be subscribed.

Issue Details

For:

set DOTNET_TieredCompilation=0
set DOTNET_JitStress=1

We see

14:38:16.844 Running test: _AdvSimd_Arm64_ro::JIT.HardwareIntrinsics.Arm._AdvSimd.Arm64.Program.VectorLookupExtension_2_Byte()
Beginning scenario: RunBasicScenario_UnsafeRead
Arm64.VectorTableLookupExtension<Byte>(Vector128<Byte>, (Vector128<Byte>, Vector128<Byte>>), Vector128<Byte>): RunBasicScenario_UnsafeRead failed:
 defaultValues: (31, 10, 93, 216, 169, 113, 148, 77, 204, 243, 166, 115, 27, 230, 43, 78)
 firstOp: (213, 126, 207, 229, 250, 214, 46, 187, 23, 195, 145, 45, 18, 53, 73, 179)
 secondOp: (198, 236, 120, 37, 38, 35, 212, 247, 230, 73, 19, 125, 181, 45, 221, 145)
 indices: (6, 1, 19, 0, 35, 20, 30, 20, 2, 4, 26, 23, 4, 35, 39, 31)
 result: (46, 126, 37, 213, 169, 38, 0, 38, 207, 250, 0, 247, 250, 230, 43, 0)

and

14:40:31.521 Running test: _AdvSimd_ro::JIT.HardwareIntrinsics.Arm._AdvSimd.Program.VectorLookupExtension_3_SByte()
Beginning scenario: RunBasicScenario_UnsafeRead
AdvSimd.VectorTableLookupExtension<SByte>(Vector64<SByte>, (Vector128<SByte>, Vector128<SByte>, Vector128<SByte>), Vector64<SByte>): RunBasicScenario_UnsafeRead failed:
 defaultValues: (126, 61, 97, 86, 65, 115, 19, 72)
 firstOp: (50, 13, 101, 20, 25, 90, 52, 33, 89, 27, 13, 109, 110, 112, 63, 98)
 secondOp: (122, 101, 116, 73, 82, 67, 54, 84, 40, 24, 93, 83, 77, 8, 37, 50)
 thirdOp: (33, 79, 16, 13, 33, 72, 74, 123, 89, 26, 63, 115, 96, 15, 59, 32)
 indices: (44, 36, 7, 43, 4, 12, 56, 11)
 result: (0, 33, 33, 0, 25, 0, 19, 0)
Author: tannergooding
Assignees: -
Labels:

arch-arm64, area-CodeGen-coreclr

Milestone: -

@kunalspathak
Copy link
Member

Thanks, I will take a look. Most likely this is happening because delay reg frees are not getting tracked properly.

@kunalspathak kunalspathak self-assigned this Apr 12, 2023
@JulieLeeMSFT JulieLeeMSFT added this to the 8.0.0 milestone Apr 12, 2023
@JulieLeeMSFT JulieLeeMSFT removed the untriaged New issue has not been triaged by the area owner label Apr 12, 2023
@kunalspathak
Copy link
Member

Looking at the result and inferring, it seems that the input operand's vector register's upper half are zeroed out. With that only indices between 0 ~ 7 and 16 ~ 23 are returning correct result while other inbound indices are returning 0. Out of bound indices are still correct and returns the correct value. Seems something corrupts the input registers during upper vector restore. Will continue investigation.

@kunalspathak
Copy link
Member

Pasting the assembly here. v17 and v18 are the table registers and are copied from v10 and v12 respectively whose upper-half was restored in the past.

Assembly
; Assembly listing for method JIT.HardwareIntrinsics.Arm._AdvSimd.Arm64.VectorLookupExtension_2Test__VectorTableLookupExtensionByte:RunBasicScenario_UnsafeRead():this
; Emitting BLENDED_CODE for generic ARM64 CPU - Windows
; optimized code
; fp based frame
; fully interruptible
; No PGO data
; 0 inlinees with PGO data; 14 single block inlinees; 0 inlinees without PGO data
; Final local variable assignments
;
;  V00 this         [V00,T01] (  5,  5   )     ref  ->  x19         this class-hnd single-def
;  V01 loc0         [V01,T22] (  2,  2   )  simd16  ->   d8         HFA(simd16) 
;  V02 tmp0         [V02,T16] (  1,  1   )     int  ->  [fp+14H]   do-not-enreg[V] "GSCookie dummy"
;# V03 OutArgs      [V03    ] (  1,  1   )  struct ( 0) [sp+00H]   do-not-enreg[XS] addr-exposed "OutgoingArgSpace"
;  V04 tmp2         [V04,T17] (  2,  4   )  simd16  ->   d8         HFA(simd16)  "impAppendStmt"
;* V05 tmp3         [V05,T18] (  0,  0   )  simd16  ->  zero-ref    HFA(simd16)  ptr "impAppendStmt"
;* V06 tmp4         [V06,T03] (  0,  0   )  struct (32) zero-ref    HFA(simd16)  do-not-enreg[SF] ld-addr-op ptr "NewObj constructor temp"
;* V07 tmp5         [V07,T19] (  0,  0   )  simd16  ->  zero-ref    HFA(simd16)  ptr "struct address for call/obj"
;* V08 tmp6         [V08    ] (  0,  0   )  struct (32) zero-ref    HFA(simd16)  do-not-enreg[SF] ptr "impAppendStmt"
;* V09 tmp7         [V09    ] (  0,  0   )  simd16  ->  zero-ref    HFA(simd16)  "struct address for call/obj"
;  V10 tmp8         [V10,T07] (  2,  4   )    long  ->  x22         "impAppendStmt"
;* V11 tmp9         [V11    ] (  0,  0   )    long  ->  zero-ref    "impAppendStmt"
;* V12 tmp10        [V12    ] (  0,  0   )    long  ->  zero-ref    "impAppendStmt"
;* V13 tmp11        [V13    ] (  0,  0   )    long  ->  zero-ref    "impAppendStmt"
;* V14 tmp12        [V14    ] (  0,  0   )     ref  ->  zero-ref    class-hnd "Inlining Arg"
;* V15 tmp13        [V15    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
;  V16 tmp14        [V16,T02] (  3,  6   )   byref  ->  x22         single-def "Inlining Arg"
;* V17 tmp15        [V17    ] (  0,  0   )    long  ->  zero-ref    ld-addr-op "Inline stloc first use temp"
;  V18 tmp16        [V18,T08] (  2,  4   )    long  ->   x0         "Inlining Arg"
;  V19 tmp17        [V19,T04] (  3,  6   )    long  ->   x1         "Inlining Arg"
;* V20 tmp18        [V20    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
;  V21 tmp19        [V21,T09] (  2,  4   )    long  ->   x0         "Inlining Arg"
;* V22 tmp20        [V22    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
;* V23 tmp21        [V23    ] (  0,  0   )    long  ->  zero-ref    "Inlining Arg"
;  V24 tmp22        [V24,T06] (  2,  4   )   byref  ->  x24         single-def "Inlining Arg"
;* V25 tmp23        [V25    ] (  0,  0   )    long  ->  zero-ref    ld-addr-op "Inline stloc first use temp"
;  V26 tmp24        [V26,T10] (  2,  4   )    long  ->  x23         "Inlining Arg"
;  V27 tmp25        [V27,T05] (  3,  6   )    long  ->  x24         "Inlining Arg"
;  V28 tmp26        [V28,T11] (  2,  4   )    long  ->  x25         "argument with side effect"
;  V29 tmp27        [V29,T12] (  2,  4   )    long  ->  x26         "argument with side effect"
;  V30 tmp28        [V30,T13] (  2,  4   )    long  ->   x5         "argument with side effect"
;  V31 GsCookie     [V31    ] (  1,  1   )    long  ->  [fp+18H]   do-not-enreg[X] addr-exposed "GSSecurityCookie"
;  V32 cse0         [V32,T20] (  2,  2   )  simd16  ->  d10         "CSE - stress mode"
;  V33 cse1         [V33,T00] ( 11, 11   )   byref  ->  x21         "CSE - aggressive"
;  V34 cse2         [V34,T21] (  2,  2   )  simd16  ->  d12         "CSE - moderate"
;  V35 cse3         [V35,T14] (  3,  3   )   byref  ->  x23         "CSE - moderate"
;  V36 cse4         [V36,T15] (  3,  3   )     ref  ->  x20         "CSE - moderate"
;
; Lcl frame size = 16

G_M52190_IG01:
            stp     fp, lr, [sp, #-0x90]!
            stp     d8, d9, [sp, #0x20]
            stp     d10, d11, [sp, #0x30]
            stp     d12, d13, [sp, #0x40]
            stp     x19, x20, [sp, #0x50]
            stp     x21, x22, [sp, #0x60]
            stp     x23, x24, [sp, #0x70]
            stp     x25, x26, [sp, #0x80]
            mov     fp, sp
            movz    x1, #0xD1FFAB1E
            movk    x1, #0xD1FFAB1E LSL #16
            movk    x1, #0xD1FFAB1E LSL #32
            movk    x1, #0xD1FFAB1E LSL #48
            str     x1, [fp, #0x18]
            mov     x19, x0
						;; size=60 bbWeight=1 PerfScore 12.00
G_M52190_IG02:
            movz    x0, #0xD1FFAB1E
            movk    x0, #0xD1FFAB1E LSL #16
            movk    x0, #0xD1FFAB1E LSL #32
            movz    x20, #0xD1FFAB1E
            movk    x20, #0xD1FFAB1E LSL #16
            movk    x20, #0xD1FFAB1E LSL #32
            mov     x1, x20
            movz    x2, #0xD1FFAB1E      // code for System.String:Concat(System.String,System.String):System.String
            movk    x2, #0xD1FFAB1E LSL #16
            movk    x2, #0xD1FFAB1E LSL #32
            ldr     x2, [x2]
            blr     x2
            movz    x1, #0xD1FFAB1E      // code for System.Console:WriteLine(System.String)
            movk    x1, #0xD1FFAB1E LSL #16
            movk    x1, #0xD1FFAB1E LSL #32
            ldr     x1, [x1]
            blr     x1
            ldrsb   wzr, [x19]
            add     x21, x19, #88
            mov     x0, x21
            movz    x1, #0xD1FFAB1E      // code for JIT.HardwareIntrinsics.Arm._AdvSimd.Arm64.VectorLookupExtension_2Test__VectorTableLookupExtensionByte+DataTable:get_inArray0Ptr():ulong:this
            movk    x1, #0xD1FFAB1E LSL #16
            movk    x1, #0xD1FFAB1E LSL #32
            ldr     x1, [x1]
            blr     x1
            ldr     q8, [x0]
            mov     x22, x21
            add     x23, x22, #56
            mov     x0, x23
            movz    x1, #0xD1FFAB1E      // code for System.Runtime.InteropServices.GCHandle:AddrOfPinnedObject():long:this
            movk    x1, #0xD1FFAB1E LSL #16
            movk    x1, #0xD1FFAB1E LSL #32
            ldr     x1, [x1]
            mov     v9.d[0], v8.d[1]
            blr     x1
            ldr     x1, [x22, #0x28]
            add     x0, x0, x1
            sub     x0, x0, #1
            sub     x1, x1, #1
            bic     x0, x0, x1
            ldr     q10, [x0]
            mov     x0, x21
            movz    x1, #0xD1FFAB1E      // code for JIT.HardwareIntrinsics.Arm._AdvSimd.Arm64.VectorLookupExtension_2Test__VectorTableLookupExtensionByte+DataTable:get_inArray2Ptr():ulong:this
            movk    x1, #0xD1FFAB1E LSL #16
            movk    x1, #0xD1FFAB1E LSL #32
            ldr     x1, [x1]
            mov     v11.d[0], v10.d[1]
            blr     x1
            ldr     q12, [x0]
            mov     v10.d[1], v11.d[0]
            mov     v12.d[1], v13.d[0]
            mov     x0, x21
            movz    x1, #0xD1FFAB1E      // code for JIT.HardwareIntrinsics.Arm._AdvSimd.Arm64.VectorLookupExtension_2Test__VectorTableLookupExtensionByte+DataTable:get_inArray3Ptr():ulong:this
            movk    x1, #0xD1FFAB1E LSL #16
            movk    x1, #0xD1FFAB1E LSL #32
            ldr     x1, [x1]
            mov     v13.d[0], v12.d[1]
            blr     x1
            ldr     q16, [x0]
            mov     v8.d[1], v9.d[0]
            mov     v17.16b, v10.16b
            mov     v18.16b, v12.16b
            tbx     v8.16b, {v17.16b, v18.16b}, v16.16b
            mov     x0, x21
            movz    x1, #0xD1FFAB1E      // code for JIT.HardwareIntrinsics.Arm._AdvSimd.Arm64.VectorLookupExtension_2Test__VectorTableLookupExtensionByte+DataTable:get_outArrayPtr():ulong:this
            movk    x1, #0xD1FFAB1E LSL #16
            movk    x1, #0xD1FFAB1E LSL #32
            ldr     x1, [x1]
            mov     v10.d[0], v8.d[1]
            blr     x1
            mov     v8.d[1], v10.d[0]
            str     q8, [x0]
            mov     x0, x21
            movz    x1, #0xD1FFAB1E      // code for JIT.HardwareIntrinsics.Arm._AdvSimd.Arm64.VectorLookupExtension_2Test__VectorTableLookupExtensionByte+DataTable:get_inArray0Ptr():ulong:this
            movk    x1, #0xD1FFAB1E LSL #16
            movk    x1, #0xD1FFAB1E LSL #32
            ldr     x1, [x1]
            blr     x1
            mov     x22, x0
            mov     x24, x21
            mov     x0, x23
            movz    x1, #0xD1FFAB1E      // code for System.Runtime.InteropServices.GCHandle:AddrOfPinnedObject():long:this
            movk    x1, #0xD1FFAB1E LSL #16
            movk    x1, #0xD1FFAB1E LSL #32
            ldr     x1, [x1]
            blr     x1
            mov     x23, x0
            ldr     x24, [x24, #0x28]
            mov     x0, x21
            movz    x1, #0xD1FFAB1E      // code for JIT.HardwareIntrinsics.Arm._AdvSimd.Arm64.VectorLookupExtension_2Test__VectorTableLookupExtensionByte+DataTable:get_inArray2Ptr():ulong:this
            movk    x1, #0xD1FFAB1E LSL #16
            movk    x1, #0xD1FFAB1E LSL #32
            ldr     x1, [x1]
            blr     x1
            mov     x25, x0
            mov     x0, x21
            movz    x1, #0xD1FFAB1E      // code for JIT.HardwareIntrinsics.Arm._AdvSimd.Arm64.VectorLookupExtension_2Test__VectorTableLookupExtensionByte+DataTable:get_inArray3Ptr():ulong:this
            movk    x1, #0xD1FFAB1E LSL #16
            movk    x1, #0xD1FFAB1E LSL #32
            ldr     x1, [x1]
            blr     x1
            mov     x26, x0
            mov     x0, x21
            movz    x1, #0xD1FFAB1E      // code for JIT.HardwareIntrinsics.Arm._AdvSimd.Arm64.VectorLookupExtension_2Test__VectorTableLookupExtensionByte+DataTable:get_outArrayPtr():ulong:this
            movk    x1, #0xD1FFAB1E LSL #16
            movk    x1, #0xD1FFAB1E LSL #32
            ldr     x1, [x1]
            blr     x1
            mov     x5, x0
            mov     x3, x25
            mov     x4, x26
            sub     x2, x24, #1
            add     x0, x24, x23
            sub     x0, x0, #1
            bic     x2, x0, x2
            mov     x0, x19
            mov     x1, x22
            mov     x6, x20
            movz    x7, #0xD1FFAB1E      // code for JIT.HardwareIntrinsics.Arm._AdvSimd.Arm64.VectorLookupExtension_2Test__VectorTableLookupExtensionByte:ValidateResult(ulong,ulong,ulong,ulong,ulong,System.String):this
            movk    x7, #0xD1FFAB1E LSL #16
            movk    x7, #0xD1FFAB1E LSL #32
            ldr     x7, [x7]
            blr     x7
            movz    xip0, #0xD1FFAB1E
            movk    xip0, #0xD1FFAB1E LSL #16
            movk    xip0, #0xD1FFAB1E LSL #32
            movk    xip0, #0xD1FFAB1E LSL #48
            ldr     xip1, [fp, #0x18]
            cmp     xip0, xip1
            beq     G_M52190_IG03
            bl      CORINFO_HELP_FAIL_FAST
						;; size=524 bbWeight=1 PerfScore 130.50
G_M52190_IG03:
            ldp     x25, x26, [sp, #0x80]
            ldp     x23, x24, [sp, #0x70]
            ldp     x21, x22, [sp, #0x60]
            ldp     x19, x20, [sp, #0x50]
            ldp     d12, d13, [sp, #0x40]
            ldp     d10, d11, [sp, #0x30]
            ldp     d8, d9, [sp, #0x20]
            ldp     fp, lr, [sp], #0x90
            ret     lr
						;; size=36 bbWeight=1 PerfScore 9.00

; Total bytes of code 620, prolog size 60, PerfScore 213.50, instruction count 155, allocated bytes for code 620 (MethodHash=bbf93421) for method JIT.HardwareIntrinsics.Arm._AdvSimd.Arm64.VectorLookupExtension_2Test__VectorTableLookupExtensionByte:RunBasicScenario_UnsafeRead():this
; ============================================================

; defaultValues: (190, 124, 118, 226, 138, 42, 111, 167, 253, 210, 20, 0, 49, 124, 130, 225)
; firstOp: (192, 163, 112, 25, 117, 254, 2, 192, 78, 94, 137, 159, 14, 248, 198, 150)
; secondOp: (198, 133, 186, 70, 250, 91, 16, 238, 194, 225, 234, 20, 87, 197, 150, 48)
; indices: (27, 34, 19, 12, 13, 13, 24, 22, 30, 28, 32, 32, 31, 36, 12, 1)
; result: (0, 124, 70, 0, 0, 0, 0, 16, 0, 0, 20, 0, 0, 124, 0, 163)

@kunalspathak
Copy link
Member

kunalspathak commented Apr 14, 2023

This seems to be an odd existing problem that was surfaced with the consecutive work because this is the first time that we started using FieldList() for non-call nodes.

For table operand of VectorTableLookup() we create FIELD_LIST to represent it. The actual rows (or vectors) goes inside the FIELD_LIST . Below, [000476] and [000494] are 2 vectors that build the [000333] field list. The field list is used as operand of intrinsic node [000331]. However, between the definition of field list ([000333]) and actual use of it (``[000331]), there are various nodes, including a CALL node [000337]` .

N003 (  5,  6) [000317] UA---------                         ▌  STORE_LCL_FLD simd16 V06 tmp4         ud:3->4[+16]
N001 (  1,  1) [000332] -----------                  t332 =    LCL_VAR   simd16<System.Runtime.Intrinsics.Vector128`1> V04 tmp2         u:2 (last use) <l:$340, c:$c4>
N002 (  1,  1) [000476] -----------                  t476 =    LCL_VAR   simd16 V32 cse0         u:1 <l:$341, c:$c5>
N003 (  1,  1) [000494] -----------                  t494 =    LCL_VAR   simd16 V34 cse2         u:1 <l:$342, c:$c6>
                                                            ┌──▌  t476   simd16 
                                                            ├──▌  t494   simd16 
N004 (  2,  2) [000333] -c---------                  t333 = ▌  FIELD_LIST struct $143
N005 (  1,  1) [000483] -----------                  t483 =    LCL_VAR   byref  V33 cse1         u:1 $300
                                                            ┌──▌  t483   byref  this in x0
N006 ( 15,  4) [000337] --CXGO-----                  t337 = ▌  CALL      long   JIT.HardwareIntrinsics.Arm._AdvSimd.Arm64.VectorLookupExtension_2Test__VectorTableLookupExtensionByte+DataTable:get_inArray3Ptr():ulong:this $18f
                                                            ┌──▌  t337   long   
N007 ( 21,  8) [000336] ---XGO-----                  t336 = ▌  IND       simd16
                                                            ┌──▌  t332   simd16 
                                                            ├──▌  t333   struct 
                                                            ├──▌  t336   simd16 
N008 ( 25, 12) [000331] ---XGO-----                  t331 = ▌  HWINTRINSIC simd16 ubyte VectorTableLookupExtension $388
                                                            ┌──▌  t331   simd16 
N010 ( 29, 15) [000329] DA-XGO-----                         ▌  STORE_LCL_VAR simd16<System.Runtime.Intrinsics.Vector128`1> V01 loc0         d:2

When we build RefPosition for intrinsic node [000331], we might have to build RefPosition to restore the upper half vectors of those vectors that are part of the field_list. For example, here is the list of refpositions created:

  1. #386 : Restore upper-half vector for the 1st operand of VectorTableLookupExtension
  2. #387 : Actual use of the 1st operand.
  3. #388: Restore upper-half vector for the 1st field of our field list which is the 2nd operand of VectorTableLookupExtension. This basically corresponds to [000476] in the above tree node dump.
  4. #389 : Actual use of the 1st field of our field list.
  5. #390 : Restore upper-half vector for the 2nd field of our field list. This basically corresponds to [000494] in the above tree node dump.
  6. #391 : Actual use of the 2st field of our field list.
  7. #392: Actual use of the 3rd operand.
  8. #393: Result of VectorTableLookupExtension.
N161 ( 25, 12) [000331] ---XGO-----                         ▌  HWINTRINSIC simd16 ubyte VectorTableLookupExtension REG NA $388
<RefPosition #386 @161 RefTypeUpperVectorRestore <Ivl:19 (U04)> LCL_VAR BB01 regmask=[allFloat] minReg=1 wt=400.00>
<RefPosition #387 @161 RefTypeUse <Ivl:2 V04> LCL_VAR BB01 regmask=[allFloat] minReg=1 last wt=400.00>
<RefPosition #388 @161 RefTypeUpperVectorRestore <Ivl:20 (U32)> LCL_VAR BB01 regmask=[allFloat] minReg=1 wt=200.00>
<RefPosition #389 @161 RefTypeUse <Ivl:14 V32> LCL_VAR BB01 regmask=[allFloat] minReg=1 last wt=200.00>
<RefPosition #390 @161 RefTypeUpperVectorRestore <Ivl:21 (U34)> LCL_VAR BB01 regmask=[allFloat] minReg=1 wt=200.00>
<RefPosition #391 @161 RefTypeUse <Ivl:16 V34> LCL_VAR BB01 regmask=[allFloat] minReg=1 last wt=200.00>
<RefPosition #392 @161 RefTypeUse <Ivl:59> BB01 regmask=[allFloat] minReg=1 last wt=100.00>
Interval 60: simd16 RefPositions {} physReg:NA Preferences=[allFloat]
<RefPosition #393 @162 RefTypeDef <Ivl:60> HWINTRINSIC BB01 regmask=[allFloat] minReg=1 wt=400.00>

Note that, Refposition #388 corresponds to the use node [000476] and does not correspond to the actual use of field_list [000331]. As a result, when we have to restore the upper-half vector for this value, we end up restoring it before [000476]. Likewise, it happens for #390, whose upper half vector restore happens at [000494] instead of at [000331].

This leads to multiple problems, some of them can be seen in this example:

  1. Since 1st field of the field_list was the last use, we restored it properly at [000476] , but did not feel the need to save the upper-half again before the CALL at [000337]. Because of the call, the upper half of the 1st field is wiped out.
  2. The save for 2nd field of the field_list happens before the CALL at [000337], but its restore happens at [000494], which happened to be located before the place where the save happened. Thus, we ended up restoring before even saving the upper half of a vector.

Ideally, we need to carry some semantics to tell that for such situations, consider the use as the field_list instead of individual fields use. I will continue investigating if that will help and feasible.

@kunalspathak
Copy link
Member

I have discovered another problem related to the ordering and seems to be pre-existing for upper vector restore and copy registers. The arrangement of refpositions for consecutive registers exposed it. This will be a long post, but it is worth describing the problem in detail.

Typically, if there is a use of a partially-saved vector (vector whose upper-half was saved before the call), we first restore it and then use it. There arises an interesting scenario in presence of FIELD_LIST, where we want to use multiple vector registers in succession, and all should be restored before the use (because all the vector registers were partially-spilled). For example, if there were 3 elements in the FIELD_LIST, the allocation ordering is something like this:

restore upper-half of V07 from r15 into r5
use V07 present in r5
restore upper-half of V08 from r8 into r6
use V08 present in r6
restore upper-half of V11 from r11 into r7
use V11 present in r7

The way we insert the restores is by appending them one after another before the use tree node, in our case, FIELD_LIST or the HWINTRINSIC node and end up something like this:

restore upper-half of V07 from r15 into r5
restore upper-half of V08 from r8 into r6
restore upper-half of V11 from r11 into r7
use V07 present in r5
use V08 present in r6
use V11 present in r7

Having such arrangement can create problems, but they can get worse and hit easily if we had to add copyRegs for each of those variables before use. copyRegs happens if a variable was assigned a register X, but at the current location, we want a different register Y, we need to insert a copy from X to Y to copy the value of variable present in register X into Y before we use the variable value from Y.

For above example, imagine for V07, V08 and V11, they were restored into a different register we need to copy them to the appropriate registers before use, we will generate copyRegs as seen below:

restore upper-half of V07 from r12 into r15
restore upper-half of V08 from r8 into r18
restore upper-half of V11 from r11 into r14
copy V07 from r15 to r5
use V07 present in r5
copy V07 from r18 to r6
use V08 present in r6
copy V07 from r14 to r7
use V11 present in r7

Now let's see an example when a problem can arise because of this. Consider the following order in which registers are allocated for set of refpositions.

[r9]  : Register from where V07 restore happen into home location r8
[r13] : Use V07, but since it was present in r8, generate `copy r8 to r13`
[r11] : Register from where V08 restore happen into home location r10
[r14] : Use V08, but since it was present in r10, generate `copy r10 to r14`
 ; Note - Below, we can assign r8 because it is dead after copyReg for V07
[r8]  : Register from where V11 restore happen into home location r12
[r15] : Use V11, but since it was present in r8, generate `copy r12 to r15`

Below is the pseudo-code of how the code should look like ideally.

restore upper-half of V07 from r9 into r8
copy V07 from r8 to r13
use V07 present in r13  ; consecutive 1
restore upper-half of V08 from r11 into r10
copy V08 from r10 to r14
use V08 present in r14  ; consecutive 2
; load value from stack into r8
restore upper-half of V11 from r8 into r12
copy V11 from r12 to r15
use V11 present in r15  ; consecutive 3

However, since we insert restores one after another, we end up with this:

restore upper-half of V07 from r9 into r8
restore upper-half of V08 from r11 into r10
; load value from stack into r8
restore upper-half of V11 from r8 into r12
copy V07 from r8 to r13   ; <--- PROBLEM. r8 held the value of V11 but we thought we are just copying the value of V08
use V07 present in r13  ; consecutive 1
copy V08 from r10 to r14
use V08 present in r14  ; consecutive 2
copy V11 from r12 to r15
use V11 present in r15  ; consecutive 3

I will check if I can tweak the algorithm to assign the same consecutive register to upper-half positions too, to avoid extra copies. I already added TODO when working on consecutive registers:

// TODO-CQ: We could technically assign RefTypeUpperVectorRestore and its RefTypeUse same register, but
// during register selection, it might get tricky to know which of the busy registers are assigned to
// RefTypeUpperVectorRestore positions of corresponding variables for which (another criteria)
// we are trying to find consecutive registers.

Another possibility is (at least for consecutive register scenarios), mark the registers into which we are restoring the values, and that way they will not used while we are processing the refpositions. In above example, we will basically mark r8 and r10 busy and hence r8 won't be selected to "restore from" register of V11.

Other option is to rearrange the restores properly so they happen right before the actual use instead of arranging them next to each other. This seems a viable option, specially for consecutive registers.

@kunalspathak
Copy link
Member

A snapshot from the Jitdump for one of the example that has problem I described above:

                                                                        ┌──▌  t550   simd16 
Generating:                [001104] ---XG------                 t1104 = ▌  RELOAD    simd16 REG d10
Generating: N397 (  1,  1) [000937] -----------                  t937 =    LCL_VAR   simd16 V74 cse0         u:1 d8 (last use) REG d8 <l:$480, c:$103>
                                                                        ┌──▌  t937   simd16 
Generating:                [001107] -----------                 t1107 = ▌  COPY      simd16 REG d13
Generating: N399 (  1,  1) [000942] -----------                  t942 =    LCL_VAR   simd16 V75 cse1         u:1 d10 (last use) REG d10 <l:$481, c:$104>
                                                                        ┌──▌  t942   simd16 
Generating:                [001110] -----------                 t1110 = ▌  COPY      simd16 REG d14
Generating: N401 (  1,  1) [000947] -----------                  t947 =    LCL_VAR   simd16 V76 cse2         u:1 d12 (last use) REG d12 <l:$482, c:$105>
                                                                        ┌──▌  t947   simd16 
Generating:                [001113] -----------                 t1113 = ▌  COPY      simd16 REG d15
                                                                        ┌──▌  t1107  simd16 
                                                                        ├──▌  t1110  simd16 
                                                                        ├──▌  t1113  simd16 
Generating: N403 (  3,  3) [000047] -c---------                   t47 = ▌  FIELD_LIST struct REG NA $143
Generating: N001 (  3,  2) [001105] -----------                 t1105 =    LCL_VAR   simd16 V74 cse0          d8 REG d8
                                                                        ┌──▌  t1105  simd16 
Generating: N002 (  4,  3) [001106] -----------                 t1106 = ▌  INTRINSIC simd16 simdUpperRestore REG d9
IN0074:                           mov     v8.d[1], v9.d[0]
Generating: N001 (  3,  2) [001108] -----------                 t1108 =    LCL_VAR   simd16 V75 cse1          d10 REG d10
                                                                        ┌──▌  t1108  simd16 
Generating: N002 (  4,  3) [001109] -----------                 t1109 = ▌  INTRINSIC simd16 simdUpperRestore REG d11
IN0075:                           mov     v10.d[1], v11.d[0]
Generating: N001 (  3,  2) [001111] -----------                 t1111 =    LCL_VAR   simd16 V76 cse2          d12 REG d12
                                                                        ┌──▌  t1111  simd16 
Generating: N002 (  4,  3) [001112] ----------z                 t1112 = ▌  INTRINSIC simd16 simdUpperRestore REG d8
IN0076:                           ldr     d8, [fp, #0x28]
IN0077:                           mov     v12.d[1], v8.d[0]
                                                                        ┌──▌  t47    struct 
                                                                        ├──▌  t1104  simd16 
Generating: N405 ( 20, 20) [000051] ---XG------                   t51 = ▌  HWINTRINSIC simd16 ubyte VectorTableLookup REG d10 <l:$4c9, c:$4c8>
							V74 in reg d8 is becoming dead  [000937]
							Live regs: 180000 {x19 x20 d8 d10 d12} => 180000 {x19 x20 d10 d12}
				Live vars after [000937]: {V00 V74 V75 V76 V77} => {V00 V75 V76 V77}
IN0078:                           mov     v13.16b, v8.16b ; <-------------- PROBLEM
							V75 in reg d10 is becoming dead  [000942]
							Live regs: 180000 {x19 x20 d10 d12} => 180000 {x19 x20 d12}
				Live vars after [000942]: {V00 V75 V76 V77} => {V00 V76 V77}
IN0079:                           mov     v14.16b, v10.16b
							V76 in reg d12 is becoming dead  [000947]
							Live regs: 180000 {x19 x20 d12} => 180000 {x19 x20}
				Live vars after [000947]: {V00 V76 V77} => {V00 V77}
IN007a:                           mov     v15.16b, v12.16b
							Tree-Node marked unspilled from [000550]
IN007b:                           ldr     q10, [fp, #0x10]
release temp #1, slot 3, size = 16
IN007c:                           tbl     v10.16b, {v13.16b, v14.16b, v15.16b}, v10.16b

@kunalspathak
Copy link
Member

Fixed by #84824

@ghost ghost locked as resolved and limited conversation to collaborators May 20, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

No branches or pull requests

3 participants