-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API Proposal: ShiftLeft, ShiftRight for Vector<T> #20510
Comments
The main conceptual problem with these methods is that, at least if our past investigations are valid, it is not feasible for the JIT to produce good SIMD instructions for these in all cases, because the second parameter is not necessarily a constant. It can do so if the methods are called with a constant parameter, but this cannot be guaranteed in C#. There is a "pit of failure" so to speak, where the performance of the method drops off a cliff if used improperly. For example, consider the following: result = ShiftLeft(v, 5); // Constant parameter
result = ShiftLeft(v, this.Settings.ShiftAmount); // Parameter from an indirect value The second call cannot be properly optimized because The most obvious workaround for this: bake specific constants into the method signature, e.g. force them to be a part of the compile-time signature: public static Vector<T> ShiftLeftOne(Vector<T> v);
public static Vector<T> ShiftLeftTwo(Vector<T> v);
public static Vector<T> ShiftLeftThree(Vector<T> v);
public static Vector<T> ShiftLeftFour(Vector<T> v);
... |
What about use of an enum? enum ShiftAmount : byte/int { One = 1, Two, Three, ... };
public static Vector<T> ShiftLeft(Vector<T> v, ShiftAmount a);
// called like:
var result = ShiftLeft(v,ShiftAmount.One);
var result = ShiftLeft(v,ShiftAmount.Two);
var result = ShiftLeft(v,ShiftAmount.Three); |
The problem is that the value cannot be guaranteed to be a compile-time constant. There is no way to guarantee that in C#, except to make the constant a part of the method signature (i.e. the name of the method). Enums don't help in that regard. |
Of course, sorry. So the C# compiler seems to have the facility to identify if something is a compile time constant or not, as evidenced by error messages on optional parameters on functions. Are there any other motivations for semantics to constrain an input parameter to be a compile time constant? Or would that definitely be a no-go? |
Could including a Roslyn analyzer in the NuGet package be a solution? That way, if you write Caveats:
|
Is NEON the only architecture where this needs to be a constant? |
Out of interest: Why doesn't the JIT support immediate operands? |
@svick Including a Roslyn analyzer is a potential option -- it's the only one we've thought of that could help here, TBH. @jackmott I'm not very familiar with NEON. The discussion above certainly applies to SSE/AVX instruction families. @mjmckp It's not that the JIT "doesn't support immediate operands". The problem is that, when writing C# code, there is no way to enforce that you call a function with an immediate operand; that just isn't a concept available in the language or type system. When the JIT encounters such a call, every option at that point is sub-optimal. The JIT could either:
Like I said above, I think the Roslyn analyzer is the best option that we've thought of. We have never shipped such an analyzer with a BCL package, though, so I'm not sure exactly what it would look like. |
@mellinoe I guess what I meant was: Why can't the JIT look at each invocation of the method, and either 1) generate a fixed instruction which contains the immediate operand when the argument is a literal value, or, 2) generate a dynamic instruction which contains the runtime value of the argument? I'm sure there's a perfectly good reason why it can't, I'm just curious! |
@mjmckp It should be able to, but the "dynamic instruction" version will be many times slower, in theory. If we hit that path, we've essentially "failed" to give you an intrinsic operation. We'd like to design something where it is obvious when you make that mistake, and how to easily fix it. |
I think there is a solution to that problem - code has to talk to compiler (Jit): [AttributeUsage(AttributeTargets.Parameter)]
internal class JitIntrinsicParamAttribute : Attribute
{
public JitIntrinsicParamAttribute(ParamType type) { Type = type; }
public IntrinsicParamType Type { get; private set; }
}
internal enum IntrinsicParamType
{
Unknown,
Immediate
}
public static Vector<T> ShiftLeft(Vector<T> x, [JitIntrinsicParam(IntrinsicParamType.Immediate)] int n); but on the other end compiler has to listen and should understand above code - without Jit changes implementation would be very difficult. Encountering above attribute compiler should implement rewriting transform which would ensure that shift operation receives argument as an immediate value disregard of what we pass to method invocation. The above pattern can be further generalized to include other difficult to implement corner cases. |
@4creators You can still pass a non-constant value to that method. The problem isn't that the JIT can't detect this case (it can), but that there is no way to gracefully recover at that point. Ideally, your code would not compile if you passed a non-constant parameter to the function, but C#/IL does not allow that in any way. |
I think using an attribute is a good idea. I would mean that:
Though, what would happen if the compiler and the analyzer produced the same error? Is there some way to avoid producing two errors in that case? |
@mellinoe My proposal essentially allows for non-constant values to be passed to intrinsic method which uses processor instruction requiring immediate operand. Intel definition of one of the shift intrinsics is as follows:
VPSLLW 128 bit Intel instruction definition is as follows:
VPSLLW 256 bit instruction definition is as follows:
Using above information one can notice that shift instructions accept not only Going further the only significant gurantee which Jit compiler has to provide is that register stores value of parameter and not it's address. In my opinion (however I do not know RyuJit internals good enough to be any authority on that) it does allow to pass a non-constant value to that instruction and the compiler (RyuJit) must ensure during all compilation passes and in particular during register allocation, instruction selection and instruction scheduling that passed value is stored in appropriate register. What is even more important based on available VPSLL*, VPSRL* we can add additional public static Vector<T> ShiftLeft(Vector<T> x, Vector<U> n);
public static Vector<T> ShiftRight(Vector<T> x, Vector<U> n);
public static operator Vector<T> <<(Vector<T> x, int n);
public static operator Vector<T> <<(Vector<T> x, Vector<U> n);
public static operator Vector<T> >>(Vector<T> x, int n);
public static operator Vector<T> >>(Vector<T> x, Vector<U> n);
public static operator Vector<T> <<=(Vector<T> x, int n);
public static operator Vector<T> <<=(Vector<T> x, Vector<U> n);
public static operator Vector<T> >>=(Vector<T> x, int n);
public static operator Vector<T> >>=(Vector<T> x, Vector<U> n); @svick I agree that attributes could be used during Roslyn compilation and by Roslyn analyzers but due to flexibility which we have on processor instruction level at least with x86 ISA this would not be required. |
After some experiments I have arrived at the following solution for enforcing passing compile time constant to basic function form, however, this would require a change to Attribute syntax. Snippet which compiles is below, unfortunately current Attribute syntax is a bit problematic here bcs one can still pass a non-constant value to ConstByteAttribute constructor (it's impossible if we use attribute inside square brackets): namespace System.Numerics.Experimental
{
[AttributeUsage(AttributeTargets.All, Inherited = false, AllowMultiple = false)]
public class ConstByteAttribute : Attribute
{
public ConstByteAttribute(byte value)
{
Value = value;
}
public byte Value
{
get;
private set;
}
}
public class Vector<T>
{
// API
public static Vector<T> ShiftLeft(Vector<T> x, ConstByteAttribute constShift)
{
throw new NotImplementedException();
}
// API usage
public static Vector<T> ShiftLeftOne(Vector<T> x)
{
return ShiftLeft(x, new ConstByteAttribute(1));
}
}
} One obvious change to C# which would be helpful for the above code would be an ability to use type parametrized attributes -> generic attributes providing that we could put on them constraints in form of value types. It would be a modification of the proposal dotnet/csharplang#569 [AttributeUsage(AttributeTargets.All, Inherited = false, AllowMultiple = false)]
public class ConstAttribute<T> : Attribute where T : byte, sbyte
{
public ConstAttribute(T value)
{
Value = value;
}
public T Value
{
get;
private set;
}
} Other changes could include implicit or explicit conversion operators form public class Vector<T>
{
// API
public static Vector<T> ShiftLeft(Vector<T> x, ConstByteAttribute constShift)
{
throw new NotImplementedException();
}
// API usage
public static Vector<T> ShiftLeftOne(Vector<T> x)
{
return ShiftLeft(x, (Const)1);
}
} Finally we have a new very useful language feature -> compile time constants (or perhaps even another class of constants -> scoped constants) which are enforced using type system (very similar to C++). |
@4creators I don't really understand what you're proposing, nor how it would solve the issue at hand. Like you have said, there is no way to ensure that someone passes a constant value to the attribute constructor, so it is no better than just passing a value in directly. I may be missing the reason you are using an I also believe you have misunderstood the behavior of the shift instructions above, but I am not an expert in that particular instruction. The documentation just seems to indicate that the shift amount is specified by the 1-byte immediate operand. Realistically, we will probably do the following:
|
@mellinoe It is misunderstending - in my first post I have used attribute to mark passed value to enable Jit support which as You have said is available:
therefore attribute use is not necessary anymore for all cases where we do not have to use constants - I thought at that time of writing that it would mean always for shift intrinsics. I have not stated that clearly in the second post discussing processor instructions. This part was edited and set in bold. What I wanted to stress in second post discussing processor instructions is that we do not have to use immediate value as an argument to shift instruction. In the x86 ISA there are almost always defined two other VPSLL*, VPSRL* instructions which use xmm register or directly memory location:
For this instruction variant operand in which one passes shift value is a
Altogether AVX2 defines 18 256bit shift instructions on words, double words and quadwords out of which only 6 require immediate operand (constant), 6 can use xmm register and 6 can use memory location. AVX defines 18 128bit shift instructions with exactly same groups as AVX2. Exception to this schema is at the level of SSE2 instructions where there are only 10 128bit two operand instructions and only 4 may use xmm or m128 memory operands. My third post, however, is an extension of argument from the first post on attribute usage but this time in a completely new context i.e. we can implement 32 Vector shift instrinsics without constants using available AVX2, AVX and SSE instructions but still there are 12 corner cases where we would need compile time constants. Actually I have noticed that we have couple of uncovered shift cases while counting x86 ISA instructions over last hour. Uncovered cases - no hardware support - comprise processors which do not support AVX/AVX2 ISA and these are pre Sandy Bridge (no AVX) Intel processors plus lower classes of even currently available processors i.e. currently produced Intel Pentium G and lower (but not Intel Pentium D). When we go lower on product ladder than even currently produced low power processor would not support SSE. |
Wouldn't the recovery option just be to use the IL implementation rather than the instrinsic? |
Yes -- that is likely an order of magnitude slower. |
Right, but it would still work. I think at some point, we have to acknowledge that Vector is there to support advanced performance scenarios, and developers using it should have a basic understanding of the implementation and restrictions. I honestly don't see how this is any worse than the current state following the addition of the Narrow/Widen/Convert methods on It started out as: And I think the 'using it correctly' restriction is kind of implicit in Vector operations anyway, because if you use them when they're not supported (for whatever reason), it's always much slower than a scalar implementation would be. |
I'm in agreement with you -- the Vector API's are for advanced users, and we can only do so much hand-holding; it is ultimately up to the user to make sure they understand what they are doing. Like I said above, and after discussing this a lot, we are leaning towards throwing an exception if a non-const value is given. I don't think that using the IL fallback path is a good idea. The JIT is able to detect when an invalid value is passed in and generate an exception instead. We will also look into a Roslyn analyzer and/or a Roslyn compiler feature that lets us specify these parameter constraints so that errors can be caught at compile time rather than runtime. |
Oh yeah, that makes perfect sense. Since a non-immediate argument is never correct, throwing an exception is probably the right thing to do. I thought the hold-up was that throwing a runtime exception wasn't acceptable and that catching it at compile time wasn't possible. |
What's wrong with non-immediate shift counts? From @4creators comment and Intel documentation it looks like all of these instructions have both |
The discussion above is just about how to handle hardware instructions which require immediate operands. If there's an instruction which doesn't take an immediate operand (and there is in this case), then there is no problem. It's just something to keep in mind when looking at these sorts of things. |
Yeah, I got mixed up there. I was thinking about shuffles because that's what I'm more interested in, and they were part of the conversation. Having thought more about it, though, I'm not sure a runtime exception would be great for those cases where an immediate is needed for the intrinsic and a variable is passed to the method. It would be possible (however unlikely) for someone to write bad code, test it with the legacy JITs and have it work with the IL implementation, only to have it fail later in a RyuJit environment. Seems like the best option is still to fall back to the slow code for consistency. |
If we include a Roslyn analyzer with the library (or land a built-in compiler feature), then the user would also have to actively ignore the build-time warning/error about misusing the API. I think I'm okay with throwing a runtime error in that case. |
I have created a C# |
I do not think that it is a good idea to throw exceptions as performance warnings when the JIT is not able to perform given optimization. There are number of cases where the slow non-interpreted implementation is perfectly acceptable. For example, profilers and other similar diagnostic tools that instrument IL can modify the IL such that the JIT may not be able to detect the constant. It is not unusual to see order of magnitude slow down in a performance sensitive methods when the expected optimization is not performed for them, nothing specific to SIMD. For example, |
@jkotas @mellinoe probably I am not understanding the root of the issue of immediate access instructions. But given that the latency for the immediate is 1 and the other case (register) is 3 in the worst case (on most relevant architectures at least), there is not such a big difference between one or the other. If the JIT can see the constant, fine, you issue the immediate instruction; if not, well you just issue the register one. The difference is for most practical purposes so small (given the alternative of not having it) that it feels like a non-issue for every AVX supported platform. With aggressive inlining set on the function itself, the JIT will constant propagate and catch most of the constant uses anyways. If you really really have an issue there, its because you know what you are doing anyways; and you can definitely fix it if possible. Even in the cases where it is just not supported, the straightforward (naive) version is instruction translate to an extra table-jump + lots of instructions + a far jump back per call; definitely not the same as having support but its doable on non AVX[1|2] without issuing a call instruction (which we should avoid like the plague). And I bet there are probably better algorithms for it anyways. I can understand the "pit of success" drive, but in the end the problem is an abstraction one. IMHO the general issue here is that probably we are not going to use this anyways in non AVX platforms (I know I wouldn't use it if I cannot know in advance). The |
😆 Seems that our discussion is misplaced because of errors in Intel Architecture @mellinoe @jkotas Perhaps it would be prudent to ask Intel for official position on that error putting behind the question MSFT weight to settle problem once for all as we have wasted quite a bit of time for non-issue.
|
I believe, for a compiler feature, adding a The runtime would still likely need to validate the parameter is a literal and throw if it is not (for manually implemented IL, or a compiler that ignores the modreq flag anyways). |
@jkotas Thanks for bringing that up; we hadn't considered such a scenario in our discussions. In general, I don't really have a strong preference for one option or the other, because I am very hopeful that we will be able to rely on a compiler feature to make the difference more-or-less irrelevant. |
Intel hardware intrinsic API proposal has been opened at dotnet/corefx#22940 |
It has been nearly 2 years. Any progress on |
You dont need it anymore. You can implement them directly using Hardware Intrinsics. 3.0 will support the whole SSE, AVX, AVX2 standard. 2.1 already support a bunch. |
This isn't much help to those constrained to use .NET framework, though. |
I'm fine with taking a proposal for @mjmckp could you please update the top post to follow our API proposal template? https://github.com/dotnet/runtime/issues/new?assignees=&labels=api-suggestion&template=02_api_proposal.md Once that is done, I can mark this as |
Any news regarding Vector, << and >> ? 😄 |
Given the OP hasn't responded, if someone wants to spend the time to write up the proposal following our template, then this can move forward. Our template is: https://github.com/dotnet/runtime/issues/new?assignees=&labels=api-suggestion&template=02_api_proposal.md The interested party can either open a new issue or post it here and I can update the original post. |
Backround and motivation
API Proposalnamespace System.Numerics
{
public static class Vector
{
public static Vector<byte> ShiftLeft(Vector<byte> x, int n);
public static Vector<short> ShiftLeft(Vector<short> x, int n);
public static Vector<int> ShiftLeft(Vector<int> x, int n);
public static Vector<long> ShiftLeft(Vector<long> x, int n);
public static Vector<nint> ShiftLeft(Vector<nint> x, int n);
public static Vector<sbyte> ShiftLeft(Vector<sbyte> x, int n);
public static Vector<ushort> ShiftLeft(Vector<ushort> x, int n);
public static Vector<uint> ShiftLeft(Vector<uint> x, int n);
public static Vector<ulong> ShiftLeft(Vector<ulong> x, int n);
public static Vector<byte> ShiftRightLogical(Vector<byte> x, int n);
public static Vector<short> ShiftRightLogical(Vector<short> x, int n);
public static Vector<int> ShiftRightLogical(Vector<int> x, int n);
public static Vector<long> ShiftRightLogical(Vector<long> x, int n);
public static Vector<nint> ShiftRightLogical(Vector<nint> x, int n);
public static Vector<sbyte> ShiftRightLogical(Vector<sbyte> x, int n);
public static Vector<ushort> ShiftRightLogical(Vector<ushort> x, int n);
public static Vector<uint> ShiftRightLogical(Vector<uint> x, int n);
public static Vector<ulong> ShiftRightLogical(Vector<ulong> x, int n);
public static Vector<short> ShiftRightArithmetic(Vector<short> x, int n);
public static Vector<int> ShiftRightArithmetic(Vector<int> x, int n);
public static Vector<long> ShiftRightArithmetic(Vector<long> x, int n);
public static Vector<nint> ShiftRightArithmetic(Vector<nint> x, int n);
public static Vector<sbyte> ShiftRightArithmetic(Vector<sbyte> x, int n);
}
} Additional NotesIt would likely be beneficial to expose The RHS of the shift operator would ideally be Variants could be exposed which support "variable shift" (that is where each element is shift by a different amount). However, this isn't available on all hardware and may be a perf pit for some users. |
|
Looks good as proposed, but we normalized to the current parameter names (value and shiftCount, respectively). namespace System.Numerics
{
public static class Vector
{
public static Vector<byte> ShiftLeft(Vector<byte> value, int shiftCount);
public static Vector<short> ShiftLeft(Vector<short> value, int shiftCount);
public static Vector<int> ShiftLeft(Vector<int> value, int shiftCount);
public static Vector<long> ShiftLeft(Vector<long> value, int shiftCount);
public static Vector<nint> ShiftLeft(Vector<nint> value, int shiftCount);
public static Vector<sbyte> ShiftLeft(Vector<sbyte> value, int shiftCount);
public static Vector<ushort> ShiftLeft(Vector<ushort> value, int shiftCount);
public static Vector<uint> ShiftLeft(Vector<uint> value, int shiftCount);
public static Vector<ulong> ShiftLeft(Vector<ulong> value, int shiftCount);
public static Vector<nuint> ShiftLeft(Vector<nuint> value, int shiftCount);
public static Vector<byte> ShiftRightLogical(Vector<byte> value, int shiftCount);
public static Vector<short> ShiftRightLogical(Vector<short> value, int shiftCount);
public static Vector<int> ShiftRightLogical(Vector<int> value, int shiftCount);
public static Vector<long> ShiftRightLogical(Vector<long> value, int shiftCount);
public static Vector<nint> ShiftRightLogical(Vector<nint> value, int shiftCount);
public static Vector<sbyte> ShiftRightLogical(Vector<sbyte> value, int shiftCount);
public static Vector<ushort> ShiftRightLogical(Vector<ushort> value, int shiftCount);
public static Vector<uint> ShiftRightLogical(Vector<uint> value, int shiftCount);
public static Vector<ulong> ShiftRightLogical(Vector<ulong> value, int shiftCount);
public static Vector<nuint> ShiftRightLogical(Vector<nuint> value, int shiftCount);
public static Vector<short> ShiftRightArithmetic(Vector<short> value, int shiftCount);
public static Vector<int> ShiftRightArithmetic(Vector<int> value, int shiftCount);
public static Vector<long> ShiftRightArithmetic(Vector<long> value, int shiftCount);
public static Vector<nint> ShiftRightArithmetic(Vector<nint> value, int shiftCount);
public static Vector<sbyte> ShiftRightArithmetic(Vector<sbyte> value, int shiftCount);
}
} |
Updated Proposal
#20510 (comment)
Original Proposal
This is a proposal of the ideas discussed in issue #20147.
There are a few SIMD operations on
System.Numerics.Vector<T>
missing that are required to implement vectorized versions oflog
,exp
,sin
, andcos
.There is a C implementation of these methods using AVX intrinsics here, which I have translated into a C# implemention using
Vector<T>
here. In the implementation, I have added fake versions of the missing intrinsics.@mellinoe suggested creating two separate issues for the missing methods (the other issue is #20509), so the purpose of this issue is to propose the addition of the following methods:
which map directly to the following AVX2 intrinsics:
respectively.
The text was updated successfully, but these errors were encountered: