Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arm64: Have CpBlkUnroll and InitBlkUnroll use SIMD registers #68085

Merged
merged 4 commits into from
Apr 22, 2022

Conversation

kunalspathak
Copy link
Member

@kunalspathak kunalspathak commented Apr 15, 2022

Do not restrict SIMD registers only for memory that are 16B aligned.
Experiment to see how many cases we miss out with today's restriction.

Motivation: https://godbolt.org/z/eb53xPvYT
Related discussion: #67326 (comment)

@ghost ghost assigned kunalspathak Apr 15, 2022
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Apr 15, 2022
@ghost
Copy link

ghost commented Apr 15, 2022

Tagging subscribers to this area: @JulieLeeMSFT
See info in area-owners.md if you want to be subscribed.

Issue Details

Do not restrict SIMD registers only for memory that are 16B aligned.
Experiment to see how many cases we miss out with today's restriction.

Motivation: https://godbolt.org/z/eb53xPvYT

Author: kunalspathak
Assignees: kunalspathak
Labels:

area-CodeGen-coreclr

Milestone: -

@@ -2680,7 +2680,7 @@ void CodeGen::genCodeForInitBlkUnroll(GenTreeBlk* node)
// The following condition prevents using 16-byte stores when dstRegAddrAlignment is:
// 1) unknown (i.e. dstReg is neither FP nor SP) or
// 2) non-zero (i.e. dstRegAddr is not 16-byte aligned).
const bool hasAvailableSimdReg = isDstRegAddrAlignmentKnown && (size > FP_REGSIZE_BYTES);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will need to adjust LSRA. Otherwise, it won't allocate SIMD register(s) when src/dst doesn't use sp/fp as base register.

Copy link
Member Author

@kunalspathak kunalspathak Apr 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @EgorChesakov . I believe that might be the reason SPC compilation is failing although, from what I understand, this was just the heuristics and we need not have to adjust LSRA.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to adjust the following logic for InitBlock

if (isDstRegAddrAlignmentKnown && (size > FP_REGSIZE_BYTES))
{
// For larger block sizes CodeGen can choose to use 16-byte SIMD instructions.
buildInternalFloatRegisterDefForNode(blkNode, internalFloatRegCandidates());
}

and for CopyBlock
bool canUse16ByteWideInstrs = isSrcAddrLocal && isDstAddrLocal && (size >= 2 * FP_REGSIZE_BYTES);
// Note that the SIMD registers allocation is speculative - LSRA doesn't know at this point
// whether CodeGen will use SIMD registers (i.e. if such instruction sequence will be more optimal).
// Therefore, it must allocate an additional integer register anyway.
if (canUse16ByteWideInstrs)
{
buildInternalFloatRegisterDefForNode(blkNode, internalFloatRegCandidates());
buildInternalFloatRegisterDefForNode(blkNode, internalFloatRegCandidates());
}

@kunalspathak
Copy link
Member Author

The (code size) diffs look very promising. I am inclined to take this PR and monitor how MicroBenchmark performs. If there are just handful of cases that are negatively impacted, we can reconsider the heuristics.

benchmarks.run.windows.arm64.checked.mch:

Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 11793820 (overridden on cmd)
Total bytes of diff: 11791688 (overridden on cmd)
Total bytes of delta: -2132 (-0.02 % of base)
diff is an improvement.
relative diff is an improvement.

coreclr_tests.pmi.windows.arm64.checked.mch:

Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 121897680 (overridden on cmd)
Total bytes of diff: 121843412 (overridden on cmd)
Total bytes of delta: -54268 (-0.04 % of base)
diff is an improvement.
relative diff is an improvement.

libraries.crossgen2.windows.arm64.checked.mch:

Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 48036464 (overridden on cmd)
Total bytes of diff: 48023944 (overridden on cmd)
Total bytes of delta: -12520 (-0.03 % of base)
diff is an improvement.
relative diff is an improvement.

libraries.pmi.windows.arm64.checked.mch:

Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 60496980 (overridden on cmd)
Total bytes of diff: 60486588 (overridden on cmd)
Total bytes of delta: -10392 (-0.02 % of base)
diff is an improvement.
relative diff is an improvement.

libraries_tests.pmi.windows.arm64.checked.mch:

Summary of Code Size diffs:
(Lower is better)

Total bytes of base: 136917816 (overridden on cmd)
Total bytes of diff: 136898512 (overridden on cmd)
Total bytes of delta: -19304 (-0.01 % of base)
diff is an improvement.
relative diff is an improvement.

Full report: https://dev.azure.com/dnceng/public/_build/results?buildId=1724373&view=ms.vss-build-web.run-extensions-tab

@kunalspathak kunalspathak changed the title WIP Arm64: Have CpBlkUnroll and InitBlkUnroll use SIMD registers Arm64: Have CpBlkUnroll and InitBlkUnroll use SIMD registers Apr 19, 2022
@kunalspathak kunalspathak marked this pull request as ready for review April 19, 2022 05:48
@kunalspathak
Copy link
Member Author

@dotnet/jit-contrib

@kunalspathak
Copy link
Member Author

ping.

@AndyAyersMS
Copy link
Member

Improvements: dotnet/perf-autofiling-issues#4992

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants