-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow StoreLclVar src to be IND/FLD. #59315
Allow StoreLclVar src to be IND/FLD. #59315
Conversation
// It usually happens when the real type is a small struct. | ||
loadIns = INS_mov; | ||
} | ||
emit->emitInsLoadInd(loadIns, emitTypeSize(tree), tree->GetRegNum(), tree); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic allows us to generate STORE_LCL_VAR struct<1> V01(IND struct(addr)
as
mov al, [addr]
mov [V11], al
instead of
movsx rax, [addr]
mov [V11], al
The solution with additional flag does not look very nice but it was better than alternatives that I have tried:
- support contained IND under
STORE_LCL_VAR
: it required too many changes in LSRA/codegen/emitter and it was hard to make it work for all cases and it created many new checks for this scenario; - type the
IND
asint
so we domov rax, [addr]
instead ofmov al, [addr]
. This could be a solution without any additional diffs but I am not sure if this transformation is correct.
We currently use it for nullchecks already since https://github.com/dotnet/runtime/pull/37991/files soNULLCHECK byte addr
is currently transformed intoNULLCHECK int addr
but if theaddr
is the last byte on a last readable page won't it cause issues? Maybe we should reconsider this transformation (inruntime/src/coreclr/jit/lower.cpp
Line 6736 in 78593b9
ind->gtType = TYP_INT;
could somebody please confirm if it can't cause issues?
PTAL @kunalspathak @BruceForstall. Could you please also trigger JitStress/JitStressRegs and tag jitcontrib? |
Tagging subscribers to this area: @JulieLeeMSFT Issue DetailsThis change allows You can see the changes with -total number of locals: 346736, number of enregistered: 309780, notEnreg: 36956, ratio: 0.89
-total number of struct locals: 10920, number of enregistered: 86, notEnreg: 10834, ratio: 0.01
+total number of locals: 346736, number of enregistered: 309849, notEnreg: 36887, ratio: 0.89
+total number of struct locals: 10920, number of enregistered: 155, notEnreg: 10765, ratio: 0.01
-m_blockOp 449, ratio: 0.01
+m_blockOp 379, ratio: 0.01
+m_lclAddrNode 1, ratio: 0.00 there are some good diffs:
There are regressions due to SIMD register upper save/restore. Imagine that we write and read a local once, but between these 2 points we have N calls, it will cause 10 UpperVectoreStore/Restore to be generated. This is an LSRA issue, it should be able to see that SIMD lclVar is rarely used but has to be saved/restored often so it is not profitable to allocate it to a register. runtime/src/coreclr/jit/lsra.cpp Lines 4985 to 5031 in cef245d
https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/lsra.cpp#L5351-L5359 The worst is:
The regressions are not in benchmarks/asp so I think we can create an issue about it but take the change as is and plan the fix for 7.0. Full diffs
Diff example-; Total bytes of code 35, prolog size 7, PerfScore 16.65, instruction count 9
+; Total bytes of code 15, prolog size 3, PerfScore 9.95, instruction count 5
-G_M9563_IG01: ; func=00, offs=000000H, size=0007H
+G_M9563_IG01: ; func=00, offs=000000H, size=0003H
-sub rsp, 24
-vzeroupper
+vzeroupper
-G_M9563_IG02: ; offs=000007H, size=0017H
+G_M9563_IG02: ; offs=000003H, size=000BH
-vmovupd xmm0, xmmword ptr [rcx]
-vmovupd xmmword ptr [V03 rsp+08H], xmm0
-vmovupd xmm0, xmmword ptr [V03 rsp+08H]
-vmovupd xmmword ptr [rdx], xmm0
-mov rax, rdx
+vmovupd xmm0, xmmword ptr [rcx]
+vmovupd xmmword ptr [rdx], xmm0
+mov rax, rdx
-G_M9563_IG03: ; offs=00001EH, size=0005H
+G_M9563_IG03: ; offs=00000EH, size=0001H
-add rsp, 24
ret
|
@dotnet/jit-contrib |
Related: #58874 However do we ever see null-check nodes on addresses that aren't pointer aligned? |
/azp run runtime-coreclr jitstress, runtime-coreclr jitstressregs |
Azure Pipelines successfully started running 2 pipeline(s). |
I don't see why we can't do something like:
if roslyn does not optimize it JIT will see: I have tried it and it seems working:
looks like we need to stop changing type in |
The failures look unrelated, @kunalspathak could I ask you to open an issue about upper save/restore cause you can add labels and know the area better? |
jitstressregs/jitstress failures related to #58481 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While the changes seems good to me, the arm64 regression is not small to ignore.
Running asm diffs of D:\Sergey\git\runtime\artifacts\spmi\mch\5ed35c58-857b-48dd-a818-7c0136dc9f73.Linux.arm64\libraries.crossgen2.Linux.arm64.checked.mch
Total bytes of delta: 26804 (1.06% of base)
3112 total files with Code Size differences (731 improved, 2381 regressed), 970 unchanged.
Running asm diffs of D:\Sergey\git\runtime\artifacts\spmi\mch\5ed35c58-857b-48dd-a818-7c0136dc9f73.Linux.arm64\libraries.pmi.Linux.arm64.checked.mch
Total bytes of delta: 36544 (3.81% of base)
2464 total files with Code Size differences (85 improved, 2379 regressed), 103 unchanged.
Running asm diffs of D:\Sergey\git\runtime\artifacts\spmi\mch\5ed35c58-857b-48dd-a818-7c0136dc9f73.windows.arm64\libraries.crossgen2.windows.arm64.checked.mch
Total bytes of delta: 26372 (1.02% of base)
3173 total files with Code Size differences (774 improved, 2399 regressed), 978 unchanged.
Running asm diffs of D:\Sergey\git\runtime\artifacts\spmi\mch\5ed35c58-857b-48dd-a818-7c0136dc9f73.windows.arm64\libraries.pmi.windows.arm64.checked.mch
Total bytes of delta: 36396 (3.65% of base)
2508 total files with Code Size differences (109 improved, 2399 regressed), 112 unchanged.
Do you mind sharing the perf score numbers along couple of diff examples? If the arm64 regressions are related to SIMD save/restore, the sensible thing to do would be to fix it first before taking this change.
Here are the full diffs including PerfScore and an example of JitDump diff, search for "Active intervals at end of allocation:" in the diff to see the new refPositions |
Ping to myself. |
Ping @kunalspathak. |
Hoping to look at it this week. |
@sandreenko - do you mind rebasing your changes to get the fresh test result? We also have asmdiff pipeline now which will help us see the diffs across various collections/platforms. |
Sure, unfortunately, I have just moved to NC and my desktop is still on its way, once I get it and assemble it I will update the diff. |
532d2f6
to
a653110
Compare
Done. It looks like SuperPmi has unrelated failures right now so I could not repeat the testing. |
@sandreenko - I want to look into your PR and understand the regressions. Unfortunately, I am busy with #64130 with various errors. Once I wrap that up, I will look into this PR. Thanks! |
Ping to myself. |
I am back now, I am planning to change this PR to x64/arm32/x86 so there are no regressions. We will enable arm64 when storeuppervector issue is fixed. |
a653110
to
6897d87
Compare
@kunalspathak the diff was updated to exclude arm64. New diffs:
the regressions are for functions with many funclet epilogues when we restore callee-save xmm in each of them, also could be improved in LSRA. Full diffs: |
@kunalspathak could you please review the new diff? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Few questions and it should be ready to go. I did check the diffs (specially those on x64) and most of them are coming because we need more registers and we would save the existing values on stack inside prolog before using them, so that should be fine.
E.g. Windows/x64 asp
38 total files with Code Size differences (14 improved, 24 regressed), 445 unchanged.
Top method regressions (bytes):
74 (1.32 % of base) : 65671.dasm - <ExecuteReaderAsync>d__19:MoveNext():this
74 (1.32 % of base) : 52754.dasm - <ExecuteReaderAsync>d__19:MoveNext():this
56 (2.45 % of base) : 66005.dasm - <DisposeAsync>d__17:MoveNext():this
56 (2.45 % of base) : 52980.dasm - <DisposeAsync>d__17:MoveNext():this
36 (1.24 % of base) : 44251.dasm - <DisposeAsync>d__17:MoveNext():this
36 (1.24 % of base) : 57988.dasm - <DisposeAsync>d__17:MoveNext():this
26 (0.94 % of base) : 63133.dasm - <DisposeAsync>d__17:MoveNext():this
25 (1.90 % of base) : 65969.dasm - RelationalCommandDiagnosticsLogger:CommandReaderExecutedAsync(IRelationalConnection,DbCommand,DbContext,Guid,Guid,DbDataReader,DateTimeOffset,TimeSpan,int,CancellationToken):ValueTask`1:this
25 (1.90 % of base) : 53137.dasm - RelationalCommandDiagnosticsLogger:CommandReaderExecutedAsync(IRelationalConnection,DbCommand,DbContext,Guid,Guid,DbDataReader,DateTimeOffset,TimeSpan,int,CancellationToken):ValueTask`1:this
17 (0.24 % of base) : 43711.dasm - <ExecuteReaderAsync>d__19:MoveNext():this
17 (0.24 % of base) : 57590.dasm - <ExecuteReaderAsync>d__19:MoveNext():this
15 (1.28 % of base) : 65917.dasm - RelationalCommandDiagnosticsLogger:CommandReaderExecutingAsync(IRelationalConnection,DbCommand,DbContext,Guid,Guid,DateTimeOffset,int,CancellationToken):ValueTask`1:this
15 (1.28 % of base) : 53131.dasm - RelationalCommandDiagnosticsLogger:CommandReaderExecutingAsync(IRelationalConnection,DbCommand,DbContext,Guid,Guid,DateTimeOffset,int,CancellationToken):ValueTask`1:this
14 (0.21 % of base) : 62916.dasm - <ExecuteReaderAsync>d__19:MoveNext():this
12 (2.07 % of base) : 29218.dasm - KeyRing:.ctor(IKey,IEnumerable`1):this
5 (0.12 % of base) : 60963.dasm - RBTree`1:RBInsert(int,int,int,int,bool):int:this
3 (0.35 % of base) : 65731.dasm - RelationalCommandDiagnosticsLogger:CommandCreated(IRelationalConnection,DbCommand,int,DbContext,Guid,Guid,DateTimeOffset,TimeSpan,int):DbCommand:this
3 (0.35 % of base) : 53129.dasm - RelationalCommandDiagnosticsLogger:CommandCreated(IRelationalConnection,DbCommand,int,DbContext,Guid,Guid,DateTimeOffset,TimeSpan,int):DbCommand:this
1 (0.88 % of base) : 60985.dasm - RBTree`1:Minimum(int):int:this
1 (0.76 % of base) : 60996.dasm - RBTree`1:Successor(int):int:this
// Do it now. | ||
GenTreeIndir* indir = src->AsIndir(); | ||
LowerIndir(indir); | ||
#if defined(TARGET_XARCH) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is defined(TARGET_XARCH)
needed and rest of the code inside !defined(TARGET_ARM64)
includes arm as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have GTF_IND_DONT_EXTEND
only for x64 because there mov rax, [addr]
is smaller than INS_movzx/INS_movsx al, [addr]
. On other platforms it is not defined and not needed.
{ | ||
if (src->OperIs(GT_OBJ, GT_BLK)) | ||
{ | ||
src->SetOper(GT_IND); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unless there is an assert somewhere that makes sure that currOper != newOper
, just remove the if (src->OperIs(GT_OBJ, GT_BLK))
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is RecordOperBashing
calls inside and I remember that I used to have breakpoints there, I would prefer not to call it if it is not neccessary.
ba27770
to
414f3d0
Compare
ping @kunalspathak |
/azp run runtime-coreclr superpmi-diffs, runtime-coreclr superpmi-replay |
Azure Pipelines successfully started running 2 pipeline(s). |
Sorry, I forgot about this. I am rerunning superpmi jobs because they timed out in last run. |
LGTM. Thank you! |
Hi, @jakobbotsch /home/qiao/work_qiao/push_runtime/src/coreclr/jit/lower.cpp:7259:10: error: no member named 'ClearDontExtend' in 'GenTreeIndir' ind->ClearDontExtend(); ~~~ ^ 1 error generated. |
@shushanhf I assume the ifdef chain there should be -#ifdef TARGET_ARM64
+#if defined(TARGET_ARM64) || defined(TARGET_LOONGARCH64)
bool useNullCheck = true;
#elif TARGET_ARM
bool useNullCheck = false;
#else // TARGET_XARCH
bool useNullCheck = !ind->Addr()->isContained();
ind->ClearDontExtend();
#endif // !TARGET_XARCH Feel free to submit this change with an upcoming PR. |
Thanks. |
Improvements on windows/x64: dotnet/perf-autofiling-issues#5496, dotnet/perf-autofiling-issues#5463 |
This change allows
STORE_LCL_VAR(IND/LCL_FLD)
to stay as is in lowering without transforming them intoSTORE_OBJ(ADDR(LCL_VAR), IND/LCL_FLD)
and allows the local to be enregistered.You can see the changes with
COMPLUS_JitEnregStats=1
, for example, for benchmarks.run.windows.x64.checked.mch:there are some good diffs:
There are regressions due to SIMD register upper save/restore. Imagine that we write and read a local once, but between these 2 points we have N calls, it will cause 10 UpperVectoreStore/Restore to be generated. This is an LSRA issue, it should be able to see that SIMD lclVar is rarely used but has to be saved/restored often so it is not profitable to allocate it to a register.
I have discussed this with @kunalspathak and he showed me that there are a few places already where we make such decisions:
runtime/src/coreclr/jit/lsra.cpp
Lines 4985 to 5031 in cef245d
runtime/src/coreclr/jit/lsra.cpp
Lines 5351 to 5359 in cef245d
The worst is:
The regressions are not in benchmarks/asp so I think we can create an issue about it but take the change as is and plan the fix for 7.0.
Full diffs
Diff example