Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Stop using LIST nodes for SIMD operand lists #1141

Closed
wants to merge 14 commits into from

Conversation

mikedn
Copy link
Contributor

@mikedn mikedn commented Dec 24, 2019

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Dec 24, 2019
@mikedn mikedn force-pushed the simd-no-list branch 2 times, most recently from 782b105 to 7c30d0d Compare December 26, 2019 20:08
@mikedn
Copy link
Contributor Author

mikedn commented Dec 26, 2019

@CarolEidt @sandreenko @echesakovMSFT

This completes the removal of LIST by changing the way SIMD and HWINTRINSIC operands are stored. I still have a few things to double check and this also has to be split into multiple PRs (I'd say 3 - one for SIMD, one for HWINTRINSIC and one for LIST leftovers removal) but I'm interested in some early feedback about the way operands are stored.

For CALL, PHI and FIELD_LIST I continued using linked lists to store operands, partly because sometimes we need to insert new operands and partly because accessing operands by index is uncommon. For SIMD/HWINTRINSIC the situation is exactly the opposite - no need to insert new operands and operand access by index is rather common - so an array of operands looks like the better choice.

We have space for 3 operands inside the SIMD node itself, that's good because it covers 99% of the intrinsic needs. For 4 operands or more (e.g. new Vector4(1, 2, 3, 4)) all the operands are stored in a separately allocated array and a pointer to this array is stored inside of the node in place of those 3 operand "inline" array:

union {
    Use  m_inlineUses[3];
    Use* m_uses;
};

For some HW intrinsics the number of operands isn't fixed and in the current implementation it is computed by counting the number of nodes in the list. That doesn't work in the array implementation so we also need to store the number of operands in the node.

uint16_t m_numOps;

With these, the operands can be accessed using the following API:

unsigned GetNumOps();
GenTree* GetOp(unsigned index);
void SetOp(unsigned index, GenTree* node);
UseArray Uses(); // range-based for loop support
Use& GetUse(unsigned index);

So:

  • Having to use a separate array is a bit unfortunate for just 4 operands. It might be interesting to relax node sizing and allow node size between TREE_NODE_SZ_SMALL and TREE_NODE_SZ_LARGE. Not sure how feasible is that and not sure how far we can go with it - if we want to allow a Create intrinsic with 32 operands do we really want to place all 32 inside the node?
  • Operands are accessed by index. 0-based index. So it's GetOp(0) & GetOp(1) instead of gtGetOp1() & gtGetOp2(). That may be a bit confusing at first. I suppose I can try to use 1-based indices but I'm not convinced that doing so isn't without drawbacks.
  • GenTreeSIMD & GenTreeHWIntrinsic are no longer GenTreeOp. That means that attempts at using gtGetOp1() & gtGetOp2() will fail.
  • GetOp(1) asserts rather that returning null like gtGetOp2 does when the node is unary. I think that's the right way to do this but unfortunately the current HW intrinsic codegen is poorly structured - code is grouped by ISA rather than the intrinsic shape/arity and that complicates things. SIMD codegen is better in this regard, with its genSIMDIntrinsicUnOp and genSIMDIntrinsicBinOp.

Other issues worth mentioning:
Unlike PHI and other previous LIST users, SIMD and HWINTRINSIC do support GTF_REVERSE_OPS. This needs to continue to work in the new operand representation so some logic needs to be copied. Not big deal, and maybe it's for the best. gtSetEvalOrder support for intrinsics is minimal anyway.

Biggest remaining issue:
All the operand logic is duplicated in GenTreeSIMD and GenTreeHWIntrinsic. It really should be common (in GenTreeJitIntrinsic base class) but I can't put it there due to the fact that they use different intrinsic enumerations and alignment holes prevent placing some data members in the base class and some in the derived class.
I think that the only way out of this is to put all data members in the base class but thanks to the widespread bad practice of making data members public this is more difficult than it needs to be.

Comments?

@mikedn mikedn force-pushed the simd-no-list branch 2 times, most recently from 958fa15 to 2c1feee Compare December 27, 2019 15:56
@CarolEidt
Copy link
Contributor

General Comments (will do more detailed review next):

It might be interesting to relax node sizing and allow node size between TREE_NODE_SZ_SMALL and TREE_NODE_SZ_LARGE.

I'm not sure how much simplification this would buy us, and would certainly cost some non-trivial implementation work, if not additional complexity.

On the issue of 0-based vs 1-based indexing, I would strongly favor maintaining 0-based indexing. I'm sure that I would be more confused by having to remember that the indexed form is 1-based to match the operand names. But I'd be interested in others' opinions on this.

GenTreeSIMD & GenTreeHWIntrinsic are no longer GenTreeOp. That means that attempts at using gtGetOp1() & gtGetOp2() will fail.

The only thing that's unfortunate about this is more conceptual than practical. That is, one would like to consider (most of?) these to have functional operator semantics, which to me seems to be implied by GenTreeOp but since that is already more of a structural characteristic than a semantic one I'm not sure it's even conceptually useful.

On the surface, the idea of sharing more between GenTreeSIMD and GenTreeHWIntrinsic is appealing, but I've not considered it in great depth.

I think that the only way out of this is to put all data members in the base class but thanks to the widespread bad practice of making data members public this is more difficult than it needs to be.

Agreed. I'm sure I'm one of the guilty parties in getting us here.

@mikedn
Copy link
Contributor Author

mikedn commented Dec 27, 2019

The only thing that's unfortunate about this is more conceptual than practical. That is, one would like to consider (most of?) these to have functional operator semantics, which to me seems to be implied by GenTreeOp but since that is already more of a structural characteristic than a semantic one I'm not sure it's even conceptually useful.

I think it would have been useful to represent true unary/binary intrinsics using GenTreeUnOp/GenTreeOp but the only reasonable way to do that seems to be to add GT_SIMD_UNARY/ GT_SIMD_BINARY to avoid the current weirdness - hey, this is a GTK_BINOP but the second operand is actually null and the first is a list of 3 operands!?! And we still need to deal with intrinsics with 3 operands and more somehow. I pondered this for a while but for some reason attempting to throw GT_SIMD_BINARY into the mix seems like a far more crazier change than the current one.

What's there to lose by not using GenTreeOp/GTK_BINOP? The obvious problem is that we need to duplicate the reverse ops logic. But I don't think it's a big deal, mainly because this logic isn't very good even today - for many 3 operand HWINTRINSICs the last operand is a constant so for Sethi–Ullman numbering they're really 2 operand nodes. But the current implementation basically treats these as unary and is unable to reverse the first 2 operands if needed.

@mikedn
Copy link
Contributor Author

mikedn commented Dec 27, 2019

Agreed. I'm sure I'm one of the guilty parties in getting us here.

Ha ha, everyone is, me included :). Sometimes it's difficult to figure out when to move away from existing code base practices. Doing so can make new code better at the cost of becoming inconsistent with the old code.

Copy link
Contributor

@CarolEidt CarolEidt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall the direction looks good to me, with some comments, suggestions and questions.

return m_numOps == 3;
}

GenTree* GetOp(unsigned index) const
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to reduce confusion it might be good to give this a different name such as GetIndexedOp or GetOpAtIndex, even though they're verbose it would at least reduce confusion with GetOp1 and GetOp2

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense but I'm a bit concerned about the longer names as these are commonly used functions.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I get that - I'd be interested in others' thoughts on this. @dotnet/jit-contrib

src/coreclr/src/jit/codegencommon.cpp Outdated Show resolved Hide resolved
src/coreclr/src/jit/gentree.cpp Outdated Show resolved Hide resolved
src/coreclr/src/jit/lsraxarch.cpp Outdated Show resolved Hide resolved
src/coreclr/src/jit/rationalize.cpp Show resolved Hide resolved
@CarolEidt
Copy link
Contributor

What's there to lose by not using GenTreeOp/GTK_BINOP?

I agree; there's not really a lot of actual value there.

All the operand logic is duplicated in GenTreeSIMD and GenTreeHWIntrinsic. It really should be common (in GenTreeJitIntrinsic base class) but I can't put it there due to the fact that they use different intrinsic enumerations and alignment holes prevent placing some data members in the base class and some in the derived class.

The fact that the SIMD and HWIntrinsic enums are distinct seems like something that could be distinguished based on opcode. That is, if I have a GT_Intrinsic then it uses CorInfoIntrinsics, if it is GT_SIMD then it uses SIMDIntrinsicID and if it's GT_HWINTRINSIC then it uses NamedIntrinsic (though IMO there should be a different enum for the HWIntrinsics).

@mikedn
Copy link
Contributor Author

mikedn commented Dec 28, 2019

The fact that the SIMD and HWIntrinsic enums are distinct seems like something that could be distinguished based on opcode

Well, yes, this problem is solvable:

struct GenTreeJitIntrinsic {
private:
    union {
        Use  m_inlineUses[3];
        Use* m_uses;
    };
    uint16_t m_numOps;
protected:
    uint16_t m_intrinsic;
    uint32_t m_unused;
};

struct GenTreeSIMD : public GenTreeJitIntrinsic {
    SIMDIntrinsicID GetIntrinsic() const {
        return static_cast<SIMDIntrinsicID>(m_intrinsic);
    }

    unsigned GetSIMDSize() const {
        return (m_unused & 0xFFFF);
    }
    
    var_types GetSIMDBaseType() const {
        return static_cast<var_types>((m_unused >> 16) & 0xFF);
    }
};

and similar GenTreeHWIntrinsic class but with NamedIntrinsic instead of SIMDIntrinsicID and an extra GetIndexBaseType function.

The only problem is that I need to change a couple more hundreds of lines to replace gtSIMDIntrinsicID & co. with GetIntrinsic() & co. But if that avoids a bunch of duplication in the many IR traversal functions that only care about operands and not then intrinsic then this might be a win for the size of the change.

@mikedn
Copy link
Contributor Author

mikedn commented Dec 28, 2019

That is, if I have a GT_Intrinsic then it uses CorInfoIntrinsics

Speaking of GT_INTRINSIC - another potential problem with this change is that it uses up all the available space in a small node. At best we can steal some bits from gtSIMDSize, gtSIMDIntrinsicID and even gtSIMDBaseType but there's no room left for a method handle like GenTreeIntrinsic has. Or limit the "inline" uses to 2 instead of 3 but that doesn't so great in the case of HWINTRINSIC where ternary operations are somewhat common.

Also, I don't like the current intrinsic situation very much:

  • There's 3 different kinds of intrinsics
  • That would probably make sense if the split was done along characteristics such as SIMD vs. scalar but that's not the case. SIMDIntrinsicAdd and NI_SSE2_Add represent the exact same operation in different ways. This already results in a ton of code duplication (import, lowering, LSRA, codegen) and the problem will get worse if we try to implement some SIMD optimizations (e.g. folding, CSE etc).
  • General purpose scalar operations are represented as intrinsics for no good reason, only because of the "if all you have is a hammer, everything looks like a nail" approach. Trivial/common operations such as ANDN or POPCNT should probably be normal genTreeOps so they can easily participate in existing optimizations such as constant folding.

I've no idea if and when this situation could be improved. In the meantime we should ensure we're not making it worse somehow.


for (GenTreeSIMD::Use& use : tree->AsSIMD()->Uses())
{
level = max(level, gtSetEvalOrder(use.GetNode()));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to double check this, it's not equivalent to the old gtSetListOrder code. On the other hand it's not clear if what gtSetListOrder did makes sense for SIMDIntrinsicInitN. Also, when looking more closely at the way SIMDIntrinsicInitN and gtSetEvalOrder work it seems that things are quite messy:

  • SIMDIntrinsicInitN has horrible register requirements - up to 5 registers - because it needs to first evaluate all operands and then it stitches everything together. This approach will never fly if we try to make intrinsics for Create methods like Vector128<byte> Create(byte e0, byte e1, byte e2, byte e3, byte e4, byte e5, byte e6, byte e7, byte e8, byte e9, byte e10, byte e11, byte e12, byte e13, byte e14, byte e15)...
  • gtSetEvalOrder runs before other optimizations so it has no idea that some trees may turn into constants or variables thus significantly reducing register requirements. This includes SIMDIntrinsicInitN becoming a pseudo-constant in lowering...
  • We can't control the evaluation order for nodes with more than 2 operands unless we add more information to GenTreeSIMD. And we're more or less out of space for any new information (e.g. an array of integers that would indicate the execution order of eveyr operand).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tweaked "level" and costs a bit but "level" is still different from what gtSetListOrder does. But what gtSetListOrder does is dubious anyway:

if (lvl < 1)
{
level = nxtlvl;
}
else if (lvl == nxtlvl)
{
level = lvl + 1;
}
else
{
level = lvl;
}
Sethi-Ullman number is basically (l1 == l2) ? (l1 + 1) : max(l1, l2) and there's no trace of max anywhere in gtSetListOrder. I suspect max was supposed to be result of setting GTF_REVERSE_OPS and swapping levels:
if (list->gtFlags & GTF_REVERSE_OPS)
{
unsigned tmpl;
tmpl = lvl;
lvl = nxtlvl;
nxtlvl = tmpl;
}
Except that GTF_REVERSE_OPS is never set on list nodes.
Whatever. Doesn't really matter and gtSetEvalOrder is a mess anyway.

@mikedn
Copy link
Contributor Author

mikedn commented Dec 30, 2019

I think that for now I'm going to ignore the SIMD/HWINTRINSIC duplication. It is primarily a pre-existing issue and the amount of new duplicated code is not that great, especially when compared with all the import/lower/lsra/codegen SIMD/HWINTRINSIC code.

Instead I'm going to work on removing the far worse pre-existing duplication produced by the bajillion custom tree traversals that exist today. With LIST gone it's easier to clean this up because I no longer have to deal with LIST related inconsistencies - included or not included in traversal. For example I included 2 commits that show how fgSetTreeSeq and fgGetFirstNode can be rewritten to use existing traversal machinery, getting rid of custom GTF_REVERSE_OPS handling in the process.

@CarolEidt
Copy link
Contributor

I think that for now I'm going to ignore the SIMD/HWINTRINSIC duplication.

Seems reasonable for now.

I included 2 commits that show how fgSetTreeSeq and fgGetFirstNode can be rewritten to use existing traversal machinery, getting rid of custom GTF_REVERSE_OPS handling in the process.

Those look quite promising, though when you're ready for final review it would be nice to limit this PR to a more minimal set of changes.

@sandreenko
Copy link
Contributor

sandreenko commented Jan 7, 2020

GenTreeSIMD & GenTreeHWIntrinsic are no longer GenTreeOp. That means that attempts at using gtGetOp1() & gtGetOp2() will fail.

Speaking of GT_INTRINSIC - another potential problem with this change is that it uses up all the available space in a small node. At best we can steal some bits from gtSIMDSize, gtSIMDIntrinsicID and even gtSIMDBaseType but there's no room left for a method handle like GenTreeIntrinsic has. Or limit the "inline" uses to 2 instead of 3 but that doesn't so great in the case of HWINTRINSIC where ternary operations are somewhat common.

With the fact that GenTreeHWIntrinsic are not longer GenTreeOp, can't we forbid bashing and changing them to other tree operands completely? It will allow us to allocate exact size for each HWIntrinsic node and we will be able to make a template parameter that means number of arguments up to 32.

Having to use a separate array is a bit unfortunate for just 4 operands. It might be interesting to relax node sizing and allow node size between TREE_NODE_SZ_SMALL and TREE_NODE_SZ_LARGE. Not sure how feasible is that and not sure how far we can go with it - if we want to allow a Create intrinsic with 32 operands do we really want to place all 32 inside the node?

I am not sure why the new sizes will be between SMALL and LARGE, can they be smaller than SMALL and larger than LARGE?

@mikedn
Copy link
Contributor Author

mikedn commented Jan 7, 2020

With the fact that GenTreeHWIntrinsic are not longer GenTreeOp, can't we forbid bashing and changing them to other tree operands completely? It will allow us to allocate exact size for each HWIntrinsic node and we will be able to make a template parameter that means number of arguments up to 32.

I suppose we could but I'm not sure we want to go to such an extreme. Currently SIMD/HWINTRINSIC trees are not optimized but perhaps in the future we want do something about that. At a minimum, we should be able to convert such nodes to LCL_VAR and a hypotetical CNS_SIMD to support CSE and constant folding.

I am not sure why the new sizes will be between SMALL and LARGE, can they be smaller than SMALL and larger than LARGE?

Hmm, I don't think there's any real reason to limit the node size to LARGE, not sure what I was thinking. I think all nodes should be at least SMALL size so we can convert anything to LCL_VAR/CNS_XYZ as mentioned above.

@sandreenko
Copy link
Contributor

At a minimum, we should be able to convert such nodes to LCL_VAR and a hypotetical CNS_SIMD to support CSE and constant folding.

We always can do a replacement instead of bashing.

@mikedn
Copy link
Contributor Author

mikedn commented Jan 7, 2020

We always can do a replacement instead of bashing.

It depends. The JIR IR wasn't really designed for that, there are cases where you'll need to use gtGetParent/TryGetUse to perform such replacements which isn't ideal. Bashing avoids that.

@mikedn mikedn closed this Jan 8, 2020
@mikedn mikedn deleted the simd-no-list branch August 30, 2020 08:10
@ghost ghost locked as resolved and limited conversation to collaborators Dec 11, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants