Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARM64-SVE: Allow LCLs to be of type MASK #109286

Merged
merged 68 commits into from
Nov 18, 2024
Merged

Conversation

a74nh
Copy link
Contributor

@a74nh a74nh commented Oct 28, 2024

Fixes #108241. Follow on to the worked started in #99608

SVE performance is being heavily hampered due to unnecessary conversion between vector and mask.

Consider

{
  vector <int> mask = Sve.CreateTrueMaskInt32();
  vector<int> vect = Sve.Compact(mask, val);
}

Here the mask will be converted to a vector for storage in mask then converted back into a mask for use in Compact. However, mask is a local variable so there are no requirements on it outside local scope. In this case the conversions can simply be removed, and mask will be stored as a mask.

Benchmarking a simple loop which takes two arrays, multiplies each element, then sums across. With this PR, the performance of SVE improves a lot:

Method Mean Error StdDev
MultiplyAddScalar 124.571 us 0.3115 us 0.2914 us
MultiplyAddNeon 7.221 us 0.0379 us 0.0354 us
MultiplyAddSve 71.440 us 0.0153 us 0.0143 us
MultiplyAddSve with mask locals 14.487 us 0.0005 us 0.0004 us

Output for test UseMaskAsMaskAndVector(): https://gist.github.com/a74nh/fc2111440c9fe17040508952d7ea5bd0

@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Oct 28, 2024
@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Oct 28, 2024
@a74nh
Copy link
Contributor Author

a74nh commented Oct 28, 2024

Early draft version. Some TODOs and failures on other code I've run it on. The pass probably needs renaming / moving to a different file.

@kunalspathak kunalspathak added the arm-sve Work related to arm64 SVE/SVE2 support label Oct 28, 2024
Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added some preliminary questions and would love to see the asmdiffs for the code.

src/coreclr/jit/morph.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/morph.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/morph.cpp Outdated Show resolved Hide resolved
@jakobbotsch
Copy link
Member

@kunalspathak I generally wait with reviews until the PR is out of draft, unless @a74nh wants me to review it now?

@jakobbotsch
Copy link
Member

jakobbotsch commented Oct 29, 2024

It seems like this would be better implemented later by making use of SSA. This is currently doing multiple IR walks which is unnecessary, and it is also not correct since nothing is verifying that the STORE_LCL(TYP_SIMD, ConvertMaskToVector(mask)) is dominating or reaching the uses that are being replaced.

@a74nh
Copy link
Contributor Author

a74nh commented Oct 29, 2024

@kunalspathak I generally wait with reviews until the PR is out of draft, unless @a74nh wants me to review it now?

These early comments are useful in helping shape the direction of the work.

It seems like this would be better implemented later by making use of SSA. This is currently doing multiple IR walks which is unnecessary,

If that makes finding all the uses (and the parents of the uses) easier, then happy to switch.

and it is also not correct since nothing is verifying that the STORE_LCL(TYP_SIMD, ConvertMaskToVector(mask)) is dominating or reaching the uses that are being replaced.

My theory was that for most uses cases (outside of Fuzzlyn), when a variable of TYP_MASK is created, then all uses of it will also be TYP_MASK. It would be a much rarer case where a variable of TYP_MASK is used as a vector. At least, that's the case on SVE, I'm not sure about AVX512. I'd also want to do some performance measurements to confirm this.

As a first version, simply making all LCLs store as TYP_MASK is there is a conversion would be a big win.

A later PR could analyse all the uses and decide which is the dominating use and optimise that way. Maybe only turn on for AVX512 at this point.

@jakobbotsch
Copy link
Member

jakobbotsch commented Oct 29, 2024

I'm more concerned about the correctness. For example, what happens for a case like

Vector<int> mask;
if (Foo()) { mask = Sve.CreateTrueMaskInt32(); } else { mask = <something else>; }
Vector<int> vect = Sve.Compact(mask, val);

? If I'm reading the code right, strange things will happen that do not properly reflect the possibility of the "else" case.

The transformation needs to behave correctly for cases like this. If you do it by making use of SSA, then you can easily know whether a use of a local that is going into ConvertVectorToMask was actually created from one of the supported patterns, since you can just go look at the definition when you find the use of the local.

@jakobbotsch
Copy link
Member

jakobbotsch commented Oct 29, 2024

I think I understand a bit better now after reading the code closer. For my case above, you will end up inserting ConvertVectorToMask in the "else" branch. So that makes the replacement possible to do in a local manner, in which case I think it is reasonable to do it after local morph.

I would probably suggest to shape it like this:

  1. Do a walk over the IR, maintaining some information for each SIMD local you see in a hash table about how many conversions you would be able to remove and how many conversions you would need to add. Note that locals that are address exposed cannot be reasoned about, so those you'll have to skip (missing in the current PR)
  2. Decide on which locals are profitable to convert, and then do another walk over the IR, converting all of these at once by removing and inserting the conversions

@kunalspathak
Copy link
Member

@kunalspathak I generally wait with reviews until the PR is out of draft, unless @a74nh wants me to review it now?

that's how typical practice it, but in this case, we wanted to seek early feedback before more progress is done (potentially in wrong direction).

@kunalspathak
Copy link
Member

  1. Do a walk over the IR, maintaining some information for each SIMD local you see in a hash table about how many conversions you would be able to remove and how many conversions you would need to add. Note that locals that are address exposed cannot be reasoned about, so those you'll have to skip (missing in the current PR)

Can this be combined with any of the existing walks of IR? How early or late can this be done?

  1. and then do another walk over the IR, converting all of these at once by removing and inserting the conversions

At what point is it ideal to do this insertion and removal?

@jakobbotsch
Copy link
Member

Can this be combined with any of the existing walks of IR? How early or late can this be done?

Probably, but I don't see a need to: these intrinsics are going to exist in very very few functions the JIT encounters, so a separate pass is going to run very rarely now that @a74nh added compConvertMaskToVectorUsed. I think it's good to keep this as a separate self-contained pass with isolated responsibilities because of that.

At what point is it ideal to do this insertion and removal?

Not sure that there is any one point that is strictly speaking better than others. The current position after local morph seems fine to me.

@a74nh
Copy link
Contributor Author

a74nh commented Oct 30, 2024

Latest version uses hashtable as suggested. Value in the table is just a signed that increments for each use that has a convert to mask, decrements for other uses. Optimise if the value is >0.

Needs a lot more commenting and tidying.
Testing: SVE testsuite works. My bunch of loops, I have one failure to debug. Then need to throw it at Fuzzlyn.

@a74nh
Copy link
Contributor Author

a74nh commented Nov 4, 2024

@jakobbotsch : Consider...

   {
        Vector<ushort> vr19 = Sve.CompareLessThanOrEqual(vr12, vr18);
        var vr20 = Sve.TestAnyTrue(vr19, vr19);
        Runtime_109286.M7(s_14, vr20, ref s_12, vr23, vr19);
    }

    [method: MethodImpl(MethodImplOptions.NoInlining)]
    private static void M7(C2 argThis, bool arg0, ref Vector128<int> arg1, bool[] arg2, Vector<ushort> arg3)
    {
    }

vr19 we definitely want to keep as a mask - CompareLessThanOrEqual() returns a mask and TestAnyTrue() uses masks. That means, when this optimisation turns everything in masks, then it needs to insert a ConvertMaskToVector() when doing the final arg of M7().

Using a GenTreeVisitor I can get hold of the local var:

               [000057] -----------                         *  LCL_VAR   simd16<System.Numerics.Vector`1> V03 loc3         

And the user:

               [000058] --C-G------                         *  CALL      void   Runtime_109286:M7(C2,ubyte,byref,ubyte[],System.Numerics.Vector`1[ushort])
               [000053] n---G------ arg0                    +--*  IND       ref   
               [000052] H----------                         |  \--*  CNS_INT(h) long   0xfbb5d80001f8 static Fseq[s_14]
               [000054] ----------- arg1                    +--*  LCL_VAR   int    V04 loc4         
               [000055] H---------- arg2                    +--*  CNS_INT(h) long   0xfbf634a02d10 static Fseq[s_12]
               [000056] ----------- arg3                    +--*  LCL_VAR   ref    V05 loc5         
               [000057] ----------- arg4                    \--*  LCL_VAR   simd16<System.Numerics.Vector`1> V03 loc3   

From those two, what's the generic way to parse user to find which op points to the local?

When I have that, I want to call gtNewSimdCvtMaskToVectorNode(). This requires the simdBaseJitType and simdSize for the local var. Where can I get that information? The lcl var isn't a hardware instrinsic, so I can't use GetSimdBaseJitType() and GetSimdSize()

@jakobbotsch
Copy link
Member

From those two, what's the generic way to parse user to find which op points to the local?

The first arg to the visit function is the edge (GenTree**), making changes should be done through it.

When I have that, I want to call gtNewSimdCvtMaskToVectorNode(). This requires the simdBaseJitType and simdSize for the local var. Where can I get that information? The lcl var isn't a hardware instrinsic, so I can't use GetSimdBaseJitType() and GetSimdSize()

It sounds like this transformation cannot be done in a local way after all: it needs to know information from the operations of the reaching definitions. The simple way would be to ensure in pass 1 that everyone agrees on the type of mask-to-vector conversion that was dropped so that you can use it when reinserting the vector-to-mask conversions in the second pass.

@a74nh
Copy link
Contributor Author

a74nh commented Nov 5, 2024

From those two, what's the generic way to parse user to find which op points to the local?

The first arg to the visit function is the edge (GenTree**), making changes should be done through it.

Agreed. In the example I have a GT_CALL node and the node I want to convert is arg5. In another example I might have a GT_HWINTRINSIC where the node I want to convert is arg1, or arg2.

Is there a generic way of parsing a GenTree to look at all the args?

Comment on lines +300 to +304
assert(lclOp->gtType != TYP_MASK);
var_types lclOrigType = lclOp->gtType;
lclOp->gtType = TYP_MASK;
LclVarDsc* varDsc = m_compiler->lvaGetDesc(lclOp->GetLclNum());
varDsc->lvType = TYP_MASK;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a check to skip the conversion for parameters (lvIsParam) and for OSR locals (lvIsOSRLocal). For OSR locals there may have been stores in the tier 0 version that you did not see and that you thus cannot update.

This suggests that the transformation would probably be better off by avoiding the retyping and instead creating new TYP_MASK locals, updating all uses to the new locals. The required handling for parameters and OSR locals would just be to insert a single initial conversion in an initial basic block. That extra conversion could be taken into account in the heuristic.

Up to you if you want to put in the restriction or change it in this suggested way. I'd probably suggest to just put in the restriction and improve it in a follow-up if you ever run into some motivating cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and instead creating new TYP_MASK locals

That would be quite a change to this PR. I would expect most uses of this pass to come from locals within a method. But, yes, it'd be fairly easy to write some tests that expose this. For now I'm happy this stay as is as it should catch almost all instances of importance. We can do some investigations once there is some real SVE code out there - there are probably more important SVE performance issues to do first.

Rephrased your comment into the code as a TODO.

@a74nh
Copy link
Contributor Author

a74nh commented Nov 13, 2024

....All review comments resolved again. However, looks like I have some Fuzzlyn issues. Will investigate.

@a74nh
Copy link
Contributor Author

a74nh commented Nov 13, 2024

Looks there is a problem:

    public static void TestEntryPoint()
    {
        Vector<ushort> vr0 = Vector.Create<ushort>(65534);
        bool x = Sve.TestLastTrue(vr0, vr0); // Use vr0 as a mask
        Consume(x);
        System.Console.WriteLine(vr0); // Use vr0 as a vector
    }

Which is essentially:

    public static void TestEntryPoint()
    {
        Vector<ushort> vr0 = Vector.Create<ushort>(65534);
        bool x = Sve.TestLastTrue(ConvertVectorToMask(vr0), ConvertVectorToMask(vr0));
        Consume(x);
        System.Console.WriteLine(vr0);
    }

With optimisations off, this outputs <65534, 65534, 65534, 65534, 65534, 65534, 65534, 65534>

With optimisations on, it optimises to:

    public static void TestEntryPoint()
    {
        mask<ushort> vr0 = ConvertVectorToMask(Vector.Create<ushort>(65534));
        bool x = Sve.TestLastTrue(vr0, vr0);
        Consume(x);
        System.Console.WriteLine(ConvertMaskToVector(vr0));
    }

The ConvertVectorToMask turns those 65534 values into single bits which get stored in the local. The program outputs: <1, 1, 1, 1, 1, 1, 1, 1>

I think what the pass needs to do is, if a vector is used as a vector (ie is used without a ConvertVectorToMask() attached) then it cannot be converted to store as a mask.

The major use case this pass is trying to optimize is when a variable is created and then only used as a mask. This is still safe to do.

To get the other cases, it can be done in the same way as suggestions for parameters - create a new store and update uses accordingly. Given we expect uses switching between masks and vectors to be the uncommon case, then I'm still happy to leave that as a later piece of work - probably in the spring.

@a74nh
Copy link
Contributor Author

a74nh commented Nov 13, 2024

Fixed to not convert if used as vector. Added additional testing. I'll keep all the old tests that don't convert because they'll be useful later.

Copy link
Member

@jakobbotsch jakobbotsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Looks like there is a conflict, can you resolve it?

@kumariaditi52

This comment was marked as off-topic.

@jakobbotsch
Copy link
Member

@kunalspathak can you take another look? (You are still marked as changes requested)

@a74nh
Copy link
Contributor Author

a74nh commented Nov 18, 2024

Some performance figures. This was running on a graviton 3 with the vector length reduced to 128bits, so figures will be a little different compared to Cobalt 100, but the magnitude of change should be similar. These routines are taken from my blog which should be published this week and I can point to a source repo then.

MultiplyAddSve 71.289 us -> 14.480 us.  507%
AddExtendSve 146.75 us -> 30.54 us.     486%
PartitionSve 332.8 us -> 212.7 us.      156%
StrlenSve 66.031 us -> 56.665 us.       117%

@kunalspathak
Copy link
Member

Some performance figures.

Thank you for sharing this. I will take a look later today.

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added few comments.

MaskConversionsWeight defaultWeight;
MaskConversionsWeight* weight = weightsTable->LookupPointerOrAdd(lclOp->GetLclNum(), defaultWeight);

JITDUMP("Local %s V%02d at [%06u] ", isLocalStore ? "store" : "var", lclOp->GetLclNum(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
JITDUMP("Local %s V%02d at [%06u] ", isLocalStore ? "store" : "var", lclOp->GetLclNum(),
JITDUMP("Local %s V%02d at [%06u] ", isLocalStore ? "store" : "use", lclOp->GetLclNum(),

// cannot be stored as a mask as data will be lost.
// For all of these, conversions could be done by creating a new store of type mask.
// Then uses as mask could be converted to type mask and pointed to use the new
// definition. Tbe weighting would need updating to take this into account.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// definition. Tbe weighting would need updating to take this into account.
// definition. The weighting would need updating to take this into account.

// Limitations:
//
// Local variables that are defined then immediately used just once may not be saved to a
// store. Here a convert to to vector will be used by a convert to mask. These instances will
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// store. Here a convert to to vector will be used by a convert to mask. These instances will
// store. Here a convert to vector will be used by a convert to mask. These instances will

// To optimize this, the pass searches every local variable definition (GT_STORE_LCL_VAR)
// and use (GT_LCL_VAR). A weighting is calculated and kept in a hash table - one entry
// for each lclvar number. The weighting contains two values. The first value is the count of
// of every convert node for the var, each instance multiplied by the number of instructions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// of every convert node for the var, each instance multiplied by the number of instructions
// of every convert node for the var - each instance multiplied by the number of instructions

// for each lclvar number. The weighting contains two values. The first value is the count of
// of every convert node for the var, each instance multiplied by the number of instructions
// in the convert and the weighting of the block it exists in. The second value assumes the
// local var has been switched to store as a mask and performs the same count. The switch
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// local var has been switched to store as a mask and performs the same count. The switch
// local var has been switched to the mask during the store and performs the similar count calculation to see what the cost of loading these "converted mask" values is back as a vector.

void InvalidateWeight()
{
JITDUMP("Invalidating weight. ");
invalid = true;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also zero out the currentCost and switchCost to make sure we accidentally don't use them for invalidated weight?

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments are minor so its ok to do a follow-up PR for them.

@kunalspathak
Copy link
Member

kunalspathak commented Nov 18, 2024

Probably a general comment, we should avoid doing the optimization if we are interfering with the call, in which case the predicate register has to be spilled and reloaded adding memory access overhead.

image

Need to think about as a follow-up PR.

@kunalspathak kunalspathak merged commit e597dc2 into dotnet:main Nov 18, 2024
114 checks passed
@a74nh a74nh deleted the storelcl_github branch November 19, 2024 11:58
@a74nh
Copy link
Contributor Author

a74nh commented Nov 20, 2024

Some performance figures. This was running on a graviton 3 with the vector length reduced to 128bits, so figures will be a little different compared to Cobalt 100, but the magnitude of change should be similar. These routines are taken from my blog which should be published this week and I can point to a source repo then.

These are now available at https://gitlab.arm.com/blogs/sveincsharp

The full blog is https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/using-sve-in-csharp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI arm-sve Work related to arm64 SVE/SVE2 support community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ARM64 SVE: mask conversion not always optimised away
5 participants