-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Stop using LIST nodes for SIMD operand lists #1141
Conversation
782b105
to
7c30d0d
Compare
@CarolEidt @sandreenko @echesakovMSFT This completes the removal of For We have space for 3 operands inside the SIMD node itself, that's good because it covers 99% of the intrinsic needs. For 4 operands or more (e.g. union {
Use m_inlineUses[3];
Use* m_uses;
}; For some HW intrinsics the number of operands isn't fixed and in the current implementation it is computed by counting the number of nodes in the list. That doesn't work in the array implementation so we also need to store the number of operands in the node. uint16_t m_numOps; With these, the operands can be accessed using the following API: unsigned GetNumOps();
GenTree* GetOp(unsigned index);
void SetOp(unsigned index, GenTree* node);
UseArray Uses(); // range-based for loop support
Use& GetUse(unsigned index); So:
Other issues worth mentioning: Biggest remaining issue: Comments? |
958fa15
to
2c1feee
Compare
General Comments (will do more detailed review next):
I'm not sure how much simplification this would buy us, and would certainly cost some non-trivial implementation work, if not additional complexity. On the issue of 0-based vs 1-based indexing, I would strongly favor maintaining 0-based indexing. I'm sure that I would be more confused by having to remember that the indexed form is 1-based to match the operand names. But I'd be interested in others' opinions on this.
The only thing that's unfortunate about this is more conceptual than practical. That is, one would like to consider (most of?) these to have functional operator semantics, which to me seems to be implied by On the surface, the idea of sharing more between
Agreed. I'm sure I'm one of the guilty parties in getting us here. |
I think it would have been useful to represent true unary/binary intrinsics using What's there to lose by not using |
Ha ha, everyone is, me included :). Sometimes it's difficult to figure out when to move away from existing code base practices. Doing so can make new code better at the cost of becoming inconsistent with the old code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall the direction looks good to me, with some comments, suggestions and questions.
return m_numOps == 3; | ||
} | ||
|
||
GenTree* GetOp(unsigned index) const |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to reduce confusion it might be good to give this a different name such as GetIndexedOp
or GetOpAtIndex
, even though they're verbose it would at least reduce confusion with GetOp1
and GetOp2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense but I'm a bit concerned about the longer names as these are commonly used functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I get that - I'd be interested in others' thoughts on this. @dotnet/jit-contrib
I agree; there's not really a lot of actual value there.
The fact that the SIMD and HWIntrinsic enums are distinct seems like something that could be distinguished based on opcode. That is, if I have a |
Well, yes, this problem is solvable: struct GenTreeJitIntrinsic {
private:
union {
Use m_inlineUses[3];
Use* m_uses;
};
uint16_t m_numOps;
protected:
uint16_t m_intrinsic;
uint32_t m_unused;
};
struct GenTreeSIMD : public GenTreeJitIntrinsic {
SIMDIntrinsicID GetIntrinsic() const {
return static_cast<SIMDIntrinsicID>(m_intrinsic);
}
unsigned GetSIMDSize() const {
return (m_unused & 0xFFFF);
}
var_types GetSIMDBaseType() const {
return static_cast<var_types>((m_unused >> 16) & 0xFF);
}
}; and similar The only problem is that I need to change a couple more hundreds of lines to replace |
Speaking of Also, I don't like the current intrinsic situation very much:
I've no idea if and when this situation could be improved. In the meantime we should ensure we're not making it worse somehow. |
|
||
for (GenTreeSIMD::Use& use : tree->AsSIMD()->Uses()) | ||
{ | ||
level = max(level, gtSetEvalOrder(use.GetNode())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to double check this, it's not equivalent to the old gtSetListOrder
code. On the other hand it's not clear if what gtSetListOrder
did makes sense for SIMDIntrinsicInitN
. Also, when looking more closely at the way SIMDIntrinsicInitN
and gtSetEvalOrder
work it seems that things are quite messy:
SIMDIntrinsicInitN
has horrible register requirements - up to 5 registers - because it needs to first evaluate all operands and then it stitches everything together. This approach will never fly if we try to make intrinsics forCreate
methods likeVector128<byte> Create(byte e0, byte e1, byte e2, byte e3, byte e4, byte e5, byte e6, byte e7, byte e8, byte e9, byte e10, byte e11, byte e12, byte e13, byte e14, byte e15)
...gtSetEvalOrder
runs before other optimizations so it has no idea that some trees may turn into constants or variables thus significantly reducing register requirements. This includesSIMDIntrinsicInitN
becoming a pseudo-constant in lowering...- We can't control the evaluation order for nodes with more than 2 operands unless we add more information to
GenTreeSIMD
. And we're more or less out of space for any new information (e.g. an array of integers that would indicate the execution order of eveyr operand).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tweaked "level" and costs a bit but "level" is still different from what gtSetListOrder
does. But what gtSetListOrder
does is dubious anyway:
runtime/src/coreclr/src/jit/gentree.cpp
Lines 2522 to 2533 in c1a51e2
if (lvl < 1) | |
{ | |
level = nxtlvl; | |
} | |
else if (lvl == nxtlvl) | |
{ | |
level = lvl + 1; | |
} | |
else | |
{ | |
level = lvl; | |
} |
(l1 == l2) ? (l1 + 1) : max(l1, l2)
and there's no trace of max
anywhere in gtSetListOrder
. I suspect max
was supposed to be result of setting GTF_REVERSE_OPS
and swapping levels:runtime/src/coreclr/src/jit/gentree.cpp
Lines 2501 to 2508 in c1a51e2
if (list->gtFlags & GTF_REVERSE_OPS) | |
{ | |
unsigned tmpl; | |
tmpl = lvl; | |
lvl = nxtlvl; | |
nxtlvl = tmpl; | |
} |
GTF_REVERSE_OPS
is never set on list nodes.Whatever. Doesn't really matter and
gtSetEvalOrder
is a mess anyway.
I think that for now I'm going to ignore the Instead I'm going to work on removing the far worse pre-existing duplication produced by the bajillion custom tree traversals that exist today. With |
Seems reasonable for now.
Those look quite promising, though when you're ready for final review it would be nice to limit this PR to a more minimal set of changes. |
With the fact that GenTreeHWIntrinsic are not longer GenTreeOp, can't we forbid bashing and changing them to other tree operands completely? It will allow us to allocate exact size for each HWIntrinsic node and we will be able to make a template parameter that means number of arguments up to 32.
I am not sure why the new sizes will be between SMALL and LARGE, can they be smaller than SMALL and larger than LARGE? |
I suppose we could but I'm not sure we want to go to such an extreme. Currently SIMD/HWINTRINSIC trees are not optimized but perhaps in the future we want do something about that. At a minimum, we should be able to convert such nodes to
Hmm, I don't think there's any real reason to limit the node size to LARGE, not sure what I was thinking. I think all nodes should be at least SMALL size so we can convert anything to |
We always can do a replacement instead of bashing. |
It depends. The JIR IR wasn't really designed for that, there are cases where you'll need to use |
Contributes to https://github.com/dotnet/coreclr/issues/19876