Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[mono][interp] Add instrinsics for common Vector128 operations #81782

Merged
merged 9 commits into from
Feb 28, 2023

Conversation

BrzVlad
Copy link
Member

@BrzVlad BrzVlad commented Feb 7, 2023

We add intrinsics for Vector128 intrinsics that are actively used within our bcl.

We declare a set of simd method names, the same way we do it with jit, in simd-methods.def. In transform-simd.c we lookup method names in the list of supported intrinsics for Vector128 and Vector128<T>. Once we find a supported instrinsic, we generate code for it, typically a MINT_SIMD_INTRINS_* opcode. In order to avoid adding too many new opcodes to the interpreter, simd intrinsics are grouped by signature. So all simd intrinsics that receive a single argument and return a value, will be called through MINT_SIMD_INTRINS_P_P. This instruction will receive an index to get the intrinsic implementation and calls it indirectly.

Some of the intrinsics are implemented using the standard vector intrinsics, supported by gcc and clang. These do not fully expose the SIMD capabilities, so some intrinsics are implemented naively. This should still be faster than using nonvectorized approach from managed code. In the future we can add better implementation, on platforms where we have low level support. This would both be faster and reduce code size.

Vectorized algorithms that I've tested on interpreter tend to be about 1.5x - 2x faster than their non-vectorized version. On wasm, the jiterpreter handles pretty well non-vectorized algorithms (since they are typically a set of simple instructions that all get compiled to a single trace) and this PR doesn't produce a clear perf improvement. Wasm should get an additional speedup once SIMD support is added into the jiterpreter.

@ghost
Copy link

ghost commented Feb 7, 2023

Tagging subscribers to this area: @BrzVlad, @kotlarmilos
See info in area-owners.md if you want to be subscribed.

Issue Details

null

Author: BrzVlad
Assignees: BrzVlad
Labels:

area-Codegen-Interpreter-mono

Milestone: -

src/mono/mono/mini/interp/interp-simd.c Outdated Show resolved Hide resolved
src/mono/mono/mini/interp/transform-simd.c Outdated Show resolved Hide resolved
src/mono/mono/mini/interp/transform.h Outdated Show resolved Hide resolved
@BrzVlad
Copy link
Member Author

BrzVlad commented Feb 15, 2023

/azp run runtime-extra-platforms

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@kg
Copy link
Member

kg commented Feb 16, 2023

Don't have a full grip on this PR yet to the point that I could checkmark it, but it seems promising so far. I have a basic jiterpreter implementation for it such that there are no trace aborts due to SIMD opcodes.

@lambdageek
Copy link
Member

This is a lot of boilerplate that is very similar to what we do for the JIT/AOT. Is there some way we can share more between the execution engines? There's a ton of these SIMD opcodes it would be better if we could generate the code for each execution engine from a single abstract template. Maybe a python script or something...

/cc @jandupej

@tannergooding
Copy link
Member

tannergooding commented Feb 16, 2023

Is there some way we can share more between the execution engines?

RyuJIT handles this via a single header file per architecture: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/hwintrinsiclistxarch.h
https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/hwintrinsiclistarm64.h

And a general header that defines the enums + basic structs + basic helper APIs: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/hwintrinsic.h

We have an entry per intrinsic that tracks most of the table drivable information and a few flags to help cover anything that needs special handling or lookups (which happens through well-defined helper methods/etc).

I imagine a very similar setup would work for JitIntrinsics -> LlvmIntrinsics or for JitIntrinsics -> MonoInstructions

Mono being C only somewhat restricts the ability for some of this to be shared between RyuJIT and Mono, but that might be fine overall.

@jandupej
Copy link
Member

This is a lot of boilerplate that is very similar to what we do for the JIT/AOT. Is there some way we can share more between the execution engines? There's a ton of these SIMD opcodes it would be better if we could generate the code for each execution engine from a single abstract template. Maybe a python script or something...

/cc @jandupej

I agree that would be beneficial. It will take some tuning to get this tabular approach right and applicable to mini, LLVM and interpreter. So my plan is to start small and do it with a single category of operations first (e.g. comparisons). When the final form is ironed out, we will convert existing code to the new form.

@kg
Copy link
Member

kg commented Feb 17, 2023

Shelving the jiterpreter bits for now, but they're at kg@a38fa17.

@BrzVlad
Copy link
Member Author

BrzVlad commented Feb 20, 2023

There are no issues that I'm aware of with this PR. It needs a final approval so we can get it in and get more perf numbers.

We add intrinsics for Vector128 intrinsics that are actively used within our bcl.

We declare a set of simd method names, the same way we do it with jit, in `simd-methods.def`. In `transform-simd.c` we lookup method names in the list of supported intrinsics for `Vector128` and `Vector128<T>`. Once we find a supported instrinsic, we generate code for it, typically a `MINT_SIMD_INTRINS_*` opcode. In order to avoid adding too many new opcodes to the interpreter, simd intrinsics are grouped by signature. So all simd intrinsics that receive a single argument and return a value, will be called through `MINT_SIMD_INTRINS_P_P`. This instruction will receive an index to get the intrinsic implementation and calls it indirectly.

Some of the intrinsics are implemented using the standard vector intrinsics, supported by gcc and clang. These do not fully expose the SIMD capabilities, so some intrinsics are implemented naively. This should still be faster than using nonvectorized approach from managed code. In the future we can add better implmentation, on platforms where we have low level support. This would both be faster and reduce code size.
These intrinsics are not yet implemented on jiterpreter, making it slighty slower instead.
v128_create receives as an argument every single element of the vector. This method is typically used with constants. For a Vector128<short> this means that creating a constant vector required 8 ldc.i4 and a v128_create. We can instead use a single instruction and embed the vector value in the code stream directly.
It is actually not used in bcl, it is not really vectorized on any platforms and the codegen for the interp implementation is massive and inefficient.
Copy link
Member

@lambdageek lambdageek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I think we should disable WASI
  2. Why can't we use __builtin_shuffle(vec1, mask)

@@ -101,6 +102,10 @@ typedef enum {

#define PROFILE_INTERP 0

#if !HOST_BROWSER && __GNUC__
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#if !HOST_BROWSER && __GNUC__
#if !HOST_BROWSER && !HOST_WASI && __GNUC__

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The real reason for disabling on browser is not because this PR has some problems with wasm, but because this PR leads to jiterpreter trace aborts since support is not yet added. On WASI there is no jiterpreter support so this PR should still have perf gains.


// Shuffle

#define V128_SHUFFLE(eltype, itype) do { \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why you have to do this by hand and not with __builtin_shuffle ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I simply didn't want to make use of any additional builtins for now, to keep it as simple as possible and reduce potential compatibility issues. Will start adding various builtins in future PR's, probably even arch specific ones. I think __builtin_shufflevector has better support and is very powerful, so might be a better candidate.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants