Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Overview
This PR demonstrates a few things which I think are useful for learnings (and optimize several cases below that ARM typically had optimized and beat us in micros):
The creation of a table driven intrinsic, e.g.,
NI_AVX512DQ_VL_ConvertToDouble128
, which allows to map a lowering between a method (though in this case, there are no frontend APIs added) and an argument type (for example,Vector128<long>
) to a specific hardware instructionvcvtps2udq
.The mapping of a vector method to a named intrinsic, for example,
Vector128.ConvertToDouble(Vector128<ulong> v)
toNI_AVX512DQ_VL_ConvertToDouble128
.The mapping of a variable length vector method to a named intrinsic, for example
Vector.ConvertToDouble(Vector<ulong> v)
toNI_AVX512DQ_VL_ConvertToDouble128
.The addition of several new
AVX512
instructions to the instruction tables.Not all intrinsics must map to a single instruction --- several will actually lower to a series of instructions, which we'd probably call it "codegen" versus a hardware intrinsic, though those terms are loosely defined.
Optimized Cases
Cases that have been optimized to use one of the single AVX512VL instructions...