-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[arm64] JIT: Enable CSE/hoisting for "arrayBase + elementOffset" #61293
Conversation
Tagging subscribers to this area: @JulieLeeMSFT Issue DetailsUsually, for a typical array access we create int Test(int[] array, int i, int j)
=> array[i] + array[j];
int Sum(int[] array)
{
int sum = 0;
foreach (int item in array)
sum += item;
return sum;
} codegen diff: ; Method Tests:Test(System.Int32[],int,int):int:this
G_M15096_IG01:
stp fp, lr, [sp,#-16]!
mov fp, sp
G_M15096_IG02:
ldr w0, [x1,#8]
cmp w2, w0
bhs G_M15096_IG04
+ add x1, x1, #16
ubfiz x2, x2, #2, #32
- add x2, x2, #16
ldr w2, [x1, x2]
cmp w3, w0
bhs G_M15096_IG04
ubfiz x0, x3, #2, #32
- add x0, x0, #16
ldr w0, [x1, x0]
add w0, w2, w0
G_M15096_IG03:
ldp fp, lr, [sp],#16
ret lr
G_M15096_IG04:
bl CORINFO_HELP_RNGCHKFAIL
bkpt
-; Total bytes of code: 72
+; Total bytes of code: 68
; Method Tests:Sum(System.Int32[]):int:this
G_M56165_IG01:
stp fp, lr, [sp,#-16]!
mov fp, sp
G_M56165_IG02:
mov w0, wzr
mov w2, wzr
ldr w3, [x1,#8]
cmp w3, #0
ble G_M56165_IG04
+ add x1, x1, #16
G_M56165_IG03:
ubfiz x4, x2, #2, #32
- add x4, x4, #16
ldr w4, [x1, x4]
add w0, w0, w4
add w2, w2, #1
cmp w3, w2
bgt G_M56165_IG03
G_M56165_IG04:
ldp fp, lr, [sp],#16
ret lr
; Total bytes of code: 64 In Diffs: coreclr_tests.pmi.Linux.arm64.checked.mch:
Detail diffs
libraries.crossgen2.Linux.arm64.checked.mch:
Detail diffs
libraries.pmi.Linux.arm64.checked.mch:
Detail diffs
libraries_tests.pmi.Linux.arm64.checked.mch:
Detail diffs
^ diff: https://www.diffchecker.com/oUPb1br3 A couple of size-regression examples: improvements:
|
Baseline:
This PR:
41.4% faster than the baseline |
/azp run Fuzzlyn, runtime-coreclr outerloop, runtime-coreclr gcstress0x3-gcstress0xc |
6ab1e92
to
fbc7ffe
Compare
/azp run Fuzzlyn, runtime-coreclr outerloop |
@dotnet/jit-contrib PTAL |
Azure Pipelines successfully started running 2 pipeline(s). |
@EgorBo - It might be good to post PerfScore diff as well, will be good to understand if there is lot of regression in it. Also, do you mind posting the diffs for benchmark collection? |
From the diffs, it looks like most of them are also coming from loop alignment. Might be worth turning loop alignment off for baseline/test to see the real code size impact. |
How do I ask superpmi to generate perfscore btw?
it looks like it's not available for altjit mode I am running spmi under (I wasn't able to run spmi on my apple-m1) |
FWIW, I usually just |
It is only available for windows arm64.
You can use windows collection by passing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like very good results.
A few minor questions.
You should probably trigger jitstress, and also gcstress0xc to validate our assumption that GC will be happy with this new form of byref |
Regression example: private static int[] _goto;
private static int Test(int destpos)
{
_goto[destpos] = AddBacktrackNote();
return _goto[destpos];
} Here index is invariant and E.g. we can pick the shape based on heuristics:
A possible solution is a sort of tree re-association phase during CSE/LoopHoisting/VN e.g. |
/azp list |
/azp run runtime-coreclr jitstress, runtime-coreclr gcstress0x3-gcstress0xc |
Azure Pipelines successfully started running 2 pipeline(s). |
@BruceForstall does it look good now? gcstress test failures are timeouts in Vector64_ro/Vector128_ro/Vector256_ro tests which are huge, I guess we might want to disable them for gcstress.
I added the Benchmarks collection to the list (-20kb diff). PerfScore diffs can be found here https://gist.github.com/EgorBo/7ed76fb5bb5a38cc56bae20306b8e080 Thoughts on reducing regressions in follow up PRs are here: #61293 (comment) |
Thanks! There are quite some benchmarks that regressed. Do they fall under same category that you listed an example of?
|
@kunalspathak I changed the code slighty, what I've done is basically converted ifdef-else into normal if-else and enabled that new shape Most of the benchmarks you listed got fixed, BenchI.AddArray is still a size regression but overall it looks like a perf improvement actually: https://www.diffchecker.com/Msv0RBMs (less code in loops) New diffs: https://gist.github.com/EgorBo/dd3b682f69ce2a530a8019c0e8bf46b2 |
arm64 improvements dotnet/perf-autofiling-issues#2248 |
More Arm64 improvements dotnet/perf-autofiling-issues#2419 |
More improvements: dotnet/perf-autofiling-issues#2406 |
Arm64 regression: dotnet/perf-autofiling-issues#2401 |
Closes #35618, #34810
Usually, for a typical array access we create the following tree
which is basically
invariant1 + (X + invariant2)
, if we change it to:I assume it's safe from GC's point of view, both ADDs are byref and within the objects
we'll be able to hoist/CSE the invariant part
arrRef + elemOffset
on arm/arm64. We don't really need it on XArch because it will be handled by a more powerful addressing mode such asmov r9d, dword ptr [rdx+4*r9+16]
.So, essentially, this PR changes morphing of GT_INDEX like this:
UPD Also, this PR folds LEA(BFIZ) into load/store with scale and sign/zero extension.
Example:
codegen diff:
Benchmark: "Sum all elements in an array":
Benchmark results on Apple M1 arm64:
^ tested with and without loop-alignment
Codegen diff:
Diffs are big but in both - and + directions:
benchmarks.run.windows.arm64.checked.mch:
Detail diffs
coreclr_tests.pmi.windows.arm64.checked.mch:
Detail diffs
libraries.crossgen2.windows.arm64.checked.mch:
Detail diffs
libraries.pmi.windows.arm64.checked.mch:
Detail diffs
libraries_tests.pmi.windows.arm64.checked.mch:
Detail diffs
some diffs we weren't able to fold cloned loops because of it, e.g. for:
^ diff: https://www.diffchecker.com/oUPb1br3
We clone the loop in both my PR and Main, but in Main we then are able to drop it, but not it "PR" due to that CSE of ADD. Both loops don't have bound checks. Btw, number of
bl CORINFO_HELP_RNGCHKFAIL
is the same in base and diff diffs so I assume this PR doesn't regress bound check elimination.A couple of size-regression examples:
https://www.diffchecker.com/VkGHhEin
https://www.diffchecker.com/nOuaRBkO
https://www.diffchecker.com/LDIAWDHY
improvements:
https://www.diffchecker.com/zxoJy9vQ