-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARM64-SVE: Allow LCLs to be of type MASK #109286
Conversation
Early draft version. Some TODOs and failures on other code I've run it on. The pass probably needs renaming / moving to a different file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added some preliminary questions and would love to see the asmdiffs for the code.
@kunalspathak I generally wait with reviews until the PR is out of draft, unless @a74nh wants me to review it now? |
It seems like this would be better implemented later by making use of SSA. This is currently doing multiple IR walks which is unnecessary, and it is also not correct since nothing is verifying that the |
These early comments are useful in helping shape the direction of the work.
If that makes finding all the uses (and the parents of the uses) easier, then happy to switch.
My theory was that for most uses cases (outside of Fuzzlyn), when a variable of As a first version, simply making all LCLs store as A later PR could analyse all the uses and decide which is the dominating use and optimise that way. Maybe only turn on for AVX512 at this point. |
I'm more concerned about the correctness. For example, what happens for a case like
? If I'm reading the code right, strange things will happen that do not properly reflect the possibility of the "else" case. The transformation needs to behave correctly for cases like this. If you do it by making use of SSA, then you can easily know whether a use of a local that is going into |
I think I understand a bit better now after reading the code closer. For my case above, you will end up inserting I would probably suggest to shape it like this:
|
that's how typical practice it, but in this case, we wanted to seek early feedback before more progress is done (potentially in wrong direction). |
Can this be combined with any of the existing walks of IR? How early or late can this be done?
At what point is it ideal to do this insertion and removal? |
Probably, but I don't see a need to: these intrinsics are going to exist in very very few functions the JIT encounters, so a separate pass is going to run very rarely now that @a74nh added
Not sure that there is any one point that is strictly speaking better than others. The current position after local morph seems fine to me. |
Latest version uses hashtable as suggested. Value in the table is just a Needs a lot more commenting and tidying. |
Change-Id: Ic18f575e266d63db38f95601d374441cdbf28b44
@jakobbotsch : Consider... {
Vector<ushort> vr19 = Sve.CompareLessThanOrEqual(vr12, vr18);
var vr20 = Sve.TestAnyTrue(vr19, vr19);
Runtime_109286.M7(s_14, vr20, ref s_12, vr23, vr19);
}
[method: MethodImpl(MethodImplOptions.NoInlining)]
private static void M7(C2 argThis, bool arg0, ref Vector128<int> arg1, bool[] arg2, Vector<ushort> arg3)
{
}
Using a
And the user:
From those two, what's the generic way to parse When I have that, I want to call |
The first arg to the visit function is the edge (
It sounds like this transformation cannot be done in a local way after all: it needs to know information from the operations of the reaching definitions. The simple way would be to ensure in pass 1 that everyone agrees on the type of mask-to-vector conversion that was dropped so that you can use it when reinserting the vector-to-mask conversions in the second pass. |
Agreed. In the example I have a Is there a generic way of parsing a GenTree to look at all the args? |
assert(lclOp->gtType != TYP_MASK); | ||
var_types lclOrigType = lclOp->gtType; | ||
lclOp->gtType = TYP_MASK; | ||
LclVarDsc* varDsc = m_compiler->lvaGetDesc(lclOp->GetLclNum()); | ||
varDsc->lvType = TYP_MASK; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs a check to skip the conversion for parameters (lvIsParam
) and for OSR locals (lvIsOSRLocal
). For OSR locals there may have been stores in the tier 0 version that you did not see and that you thus cannot update.
This suggests that the transformation would probably be better off by avoiding the retyping and instead creating new TYP_MASK
locals, updating all uses to the new locals. The required handling for parameters and OSR locals would just be to insert a single initial conversion in an initial basic block. That extra conversion could be taken into account in the heuristic.
Up to you if you want to put in the restriction or change it in this suggested way. I'd probably suggest to just put in the restriction and improve it in a follow-up if you ever run into some motivating cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and instead creating new TYP_MASK locals
That would be quite a change to this PR. I would expect most uses of this pass to come from locals within a method. But, yes, it'd be fairly easy to write some tests that expose this. For now I'm happy this stay as is as it should catch almost all instances of importance. We can do some investigations once there is some real SVE code out there - there are probably more important SVE performance issues to do first.
Rephrased your comment into the code as a TODO.
....All review comments resolved again. However, looks like I have some Fuzzlyn issues. Will investigate. |
Looks there is a problem: public static void TestEntryPoint()
{
Vector<ushort> vr0 = Vector.Create<ushort>(65534);
bool x = Sve.TestLastTrue(vr0, vr0); // Use vr0 as a mask
Consume(x);
System.Console.WriteLine(vr0); // Use vr0 as a vector
} Which is essentially: public static void TestEntryPoint()
{
Vector<ushort> vr0 = Vector.Create<ushort>(65534);
bool x = Sve.TestLastTrue(ConvertVectorToMask(vr0), ConvertVectorToMask(vr0));
Consume(x);
System.Console.WriteLine(vr0);
} With optimisations off, this outputs With optimisations on, it optimises to: public static void TestEntryPoint()
{
mask<ushort> vr0 = ConvertVectorToMask(Vector.Create<ushort>(65534));
bool x = Sve.TestLastTrue(vr0, vr0);
Consume(x);
System.Console.WriteLine(ConvertMaskToVector(vr0));
} The I think what the pass needs to do is, if a vector is used as a vector (ie is used without a ConvertVectorToMask() attached) then it cannot be converted to store as a mask. The major use case this pass is trying to optimize is when a variable is created and then only used as a mask. This is still safe to do. To get the other cases, it can be done in the same way as suggestions for parameters - create a new store and update uses accordingly. Given we expect uses switching between masks and vectors to be the uncommon case, then I'm still happy to leave that as a later piece of work - probably in the spring. |
Fixed to not convert if used as vector. Added additional testing. I'll keep all the old tests that don't convert because they'll be useful later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Looks like there is a conflict, can you resolve it?
This comment was marked as off-topic.
This comment was marked as off-topic.
@kunalspathak can you take another look? (You are still marked as changes requested) |
Some performance figures. This was running on a graviton 3 with the vector length reduced to 128bits, so figures will be a little different compared to Cobalt 100, but the magnitude of change should be similar. These routines are taken from my blog which should be published this week and I can point to a source repo then.
|
Thank you for sharing this. I will take a look later today. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added few comments.
MaskConversionsWeight defaultWeight; | ||
MaskConversionsWeight* weight = weightsTable->LookupPointerOrAdd(lclOp->GetLclNum(), defaultWeight); | ||
|
||
JITDUMP("Local %s V%02d at [%06u] ", isLocalStore ? "store" : "var", lclOp->GetLclNum(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
JITDUMP("Local %s V%02d at [%06u] ", isLocalStore ? "store" : "var", lclOp->GetLclNum(), | |
JITDUMP("Local %s V%02d at [%06u] ", isLocalStore ? "store" : "use", lclOp->GetLclNum(), |
// cannot be stored as a mask as data will be lost. | ||
// For all of these, conversions could be done by creating a new store of type mask. | ||
// Then uses as mask could be converted to type mask and pointed to use the new | ||
// definition. Tbe weighting would need updating to take this into account. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// definition. Tbe weighting would need updating to take this into account. | |
// definition. The weighting would need updating to take this into account. |
// Limitations: | ||
// | ||
// Local variables that are defined then immediately used just once may not be saved to a | ||
// store. Here a convert to to vector will be used by a convert to mask. These instances will |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// store. Here a convert to to vector will be used by a convert to mask. These instances will | |
// store. Here a convert to vector will be used by a convert to mask. These instances will |
// To optimize this, the pass searches every local variable definition (GT_STORE_LCL_VAR) | ||
// and use (GT_LCL_VAR). A weighting is calculated and kept in a hash table - one entry | ||
// for each lclvar number. The weighting contains two values. The first value is the count of | ||
// of every convert node for the var, each instance multiplied by the number of instructions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// of every convert node for the var, each instance multiplied by the number of instructions | |
// of every convert node for the var - each instance multiplied by the number of instructions |
// for each lclvar number. The weighting contains two values. The first value is the count of | ||
// of every convert node for the var, each instance multiplied by the number of instructions | ||
// in the convert and the weighting of the block it exists in. The second value assumes the | ||
// local var has been switched to store as a mask and performs the same count. The switch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// local var has been switched to store as a mask and performs the same count. The switch | |
// local var has been switched to the mask during the store and performs the similar count calculation to see what the cost of loading these "converted mask" values is back as a vector. |
void InvalidateWeight() | ||
{ | ||
JITDUMP("Invalidating weight. "); | ||
invalid = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we also zero out the currentCost
and switchCost
to make sure we accidentally don't use them for invalidated weight?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments are minor so its ok to do a follow-up PR for them.
These are now available at https://gitlab.arm.com/blogs/sveincsharp The full blog is https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/using-sve-in-csharp |
Fixes #108241. Follow on to the worked started in #99608
SVE performance is being heavily hampered due to unnecessary conversion between vector and mask.
Consider
Here the mask will be converted to a vector for storage in
mask
then converted back into a mask for use inCompact
. However,mask
is a local variable so there are no requirements on it outside local scope. In this case the conversions can simply be removed, andmask
will be stored as a mask.Benchmarking a simple loop which takes two arrays, multiplies each element, then sums across. With this PR, the performance of SVE improves a lot:
Output for test
UseMaskAsMaskAndVector()
: https://gist.github.com/a74nh/fc2111440c9fe17040508952d7ea5bd0