Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve/cleanup RyuJIT's IR handling of operand lists #11058

Closed
3 of 6 tasks
mikedn opened this issue Sep 8, 2018 · 26 comments
Closed
3 of 6 tasks

Improve/cleanup RyuJIT's IR handling of operand lists #11058

mikedn opened this issue Sep 8, 2018 · 26 comments
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions JitUntriaged CLR JIT issues needing additional triage
Milestone

Comments

@mikedn
Copy link
Contributor

mikedn commented Sep 8, 2018

Some IR nodes require a variable number of operands and the current implementation uses linked lists of GT_LIST (or similar) nodes.

IR nodes knows to use such lists:

  • GT_CALL - has a dedicated this argument operand and a list of operands for all other arguments
  • GT_PHI - this is a GTK_UNOP node and its sole operand is a list
  • GT_HWINTRINSIC - this is a GTK_BINOP node but if needs 3 operands or more the first operand is a list and the second is null
  • GT_FIELD_LIST - a GTK_BINOP that acts both like an actual IR node or like a list node
  • Anything else?

From the above only GT_CALL is reasonable, though it could probably be improved as well. The rest suffer from various issues:

  • GT_PHI - this looks similar to GT_CALL on the surface but it lacks the special handling of calls throughout the JIT. Its list nodes end up in the linear order and they are visited by tree walking facilities like GenTreeVisitor. This is nonsense - these list nodes are an internal implementation detail of the GT_PHI node and they should not appear anywhere. They have no type, no VN, no nothing. They simply do not exists as far as the IR is concerned.
  • GT_HWINTRINSIC suffers from the same problem as GT_PHI. It just makes it worse by sometimes using lists and sometimes not. And unlike GT_PHI, this lacks special handling in gtDispTree and dumps its list nodes.
  • GT_FIELD_LIST IMO it would be better have a GT_PACK node with a list of GT_FIELD_LIST nodes.

category:implementation
theme:ir
skill-level:expert
cost:extra-large

@mikedn
Copy link
Contributor Author

mikedn commented Sep 8, 2018

At a minimum, we should treat such lists consistently - if call's list nodes aren't visited by GenTreeVisitor & co. then the same should happen for PHIs and others. Same for adding list nodes to the linear order.

And once you stop traversing list nodes the next question comes: do they need to be GenTree nodes? A simple list node would have 2 pointers - next and op - 16 bytes. GenTreeArgList has 80 bytes. So every PHI argument needs 80 * 2 bytes. Hrm.

@mikedn
Copy link
Contributor Author

mikedn commented Sep 9, 2018

From the above only GT_CALL is reasonable

Actually it is not. fgSetTreeSeqHelper does add the GT_LIST nodes used by GT_CALL to the linear order. It's GenTreeVisitor that does not visit them.

@ArtBlnd

This comment has been minimized.

@mikedn

This comment has been minimized.

@mikedn
Copy link
Contributor Author

mikedn commented Sep 12, 2018

And once you stop traversing list nodes the next question comes: do they need to be GenTree nodes? A simple list node would have 2 pointers - next and op - 16 bytes. GenTreeArgList has 80 bytes. So every PHI argument needs 80 * 2 bytes. Hrm.

Tried to change PHI to use its own linked list nodes instead of GT_LIST. Works fine, there's simply no need to use GT_LIST and waste both memory and time (to link list nodes in linear order).

Same for GT_FIELD_LIST. The only weird thing about it is that since the field list head is also an IR node the IR node type is the same as the first field's type. You'd think that the IR node type is struct.

It's very likely that calls also do not need to use GT_LIST nodes. There are a few list dependencies here and there but nothing that seems strictly necessary. Quite the contrary in fact.

HW intrinsics are a bit of a problem. Since their op1 is either a list or a real node it's more difficult to avoid using GT_LIST. But if it turns out that HW intrinsics are the only nodes that truly need GT_LIST then there must be a way to change that.

@ArtBlnd

This comment has been minimized.

@ArtBlnd

This comment has been minimized.

@mikedn

This comment has been minimized.

@ArtBlnd

This comment has been minimized.

@ArtBlnd

This comment has been minimized.

@mikedn

This comment has been minimized.

@ArtBlnd

This comment has been minimized.

@mikedn
Copy link
Contributor Author

mikedn commented Dec 23, 2018

HW intrinsics are a bit of a problem. Since their op1 is either a list or a real node it's more difficult to avoid using GT_LIST. But if it turns out that HW intrinsics are the only nodes that truly need GT_LIST then there must be a way to change that.

GT_HWINTRINSIC/GT_SIMD nodes need to be GTK_SPECIAL and have a list/array of operands. Having them be GTK_BINOP/GenTreeOp when they can have 1-32 operands doesn't help, quite the opposite. This basically subverts the meaning GTK_BINOP and requires all IR traversal utilities to add special casing to handle the GT_LIST operand. So if it's special then let's make it special.

Unlike GT_PHI and GT_FIELD_LIST, SIMD nodes tend to require random access to operands so it's probably best to use an array rather than a linked list of uses. Up to 3 operands fit an "inline" array, for more than 3 nodes we'll need a separate array.

The only question is what to do about GTF_REVERSE_OPS, it may be useful for it to still work on SIMD nodes having exactly 2 operands. But evaluation order is relevant for any number of operands higher than 1 so perhaps an alternative mechanism should be used. Anyway GTF_REVERSE_OPS is very much a legacy JIT32 thing (and a pretty horrible thing).

@tannergooding
Copy link
Member

when they can have 1-32 operands doesn't help

I think we are in a slightly better position here now.

Previously, we had the SetVector128/256 methods that were treated as intrinsic and they would have one operand per element. That has since changed and those methods are implemented in software (generally by calling existing intrinsics like Insert). So, I believe we now only have GT_HWINTRINSIC nodes that are 1, 2, or 3 operands. Ah, and there is also the Avx2.GatherMaskVector128/256 intrinsic that takes 4 operands.

So I think (at least for the HW_INTRINSIC nodes) the only time we need more than 4 is if/when support for emitting constants (https://github.com/dotnet/coreclr/issues/17225) is added; and that would likely only be needed if we wanted to support inputs that were determined to be constant post importation (similar to https://github.com/dotnet/coreclr/issues/17108 -- but maybe someone has a good idea on how to handle this).

I don't think we can workaround having 3 operands (since things like FMA have three distinct operands), but we might be able to do something about the 4 operand GatherMaskVector case since the actual instruction ends up combing 3 of the operands into a Vector SIB Memory Address

@mikedn
Copy link
Contributor Author

mikedn commented Jan 10, 2019

Ah, and there is also the Avx2.GatherMaskVector128/256 intrinsic that takes 4 operands.

Yes, and 4 is already more that I can fit in a small node.

similar to #9989 -- but maybe someone has a good idea on how to handle this

Hmm? But the constant problem is already solved in dotnet/coreclr#14393. We just need to extend that to 32 byte constants, that requires VM support to allocate 32 byte aligned data sections. Shouldn't be a big problem I guess. And perhaps we need actual constant SIMD nodes in the IR, though the cases where that would be useful are pretty rare today.

I don't think we can workaround having 3 operands (since things like FMA have three distinct operands), but we might be able to do something about the 4 operand GatherMaskVector case since the actual instruction ends up combing 3 of the operands into a Vector SIB Memory Address

Yes, but even if we get rid of gather there's also new Vector4(x, y, z, w), with or without constant support.

@tannergooding
Copy link
Member

Yes, but even if we get rid of gather there's also new Vector4(x, y, z, w), with or without constant support.

We might be able to implement that in software, similarly to how we are doing Vector128<float> Vector128.Create(x, y, z, w) for HWIntrinsics. Or, possibly handle it in the importer by doing the ISA check and replacing it with the corresponding chain of SIMD intrinsics...
For Sse41 this is just SetX, SetY, SetZ, and SetW; which emits insertps. For Sse2 we would need something to represent Unpack, much as we already have a special Shuffle SIMD intrinsic (which sounds like it might make things easier than having to support 4 operand nodes for a single scenario).

We just need to extend that to 32 byte constants, that requires VM support to allocate 32 byte aligned data section

I think we could probably get by, at least initially with a 16-byte aligned data section and just using movups; although having it be properly aligned would definitely be more beneficial in the long-run (depending on if it can easily be made "pay for play" or not).

@mikedn
Copy link
Contributor Author

mikedn commented Jan 10, 2019

We might be able to implement that in software, similarly to how we are doing Vector128 Vector128.Create(x, y, z, w) for HWIntrinsics

Maybe I'm missing something but what would that buy us? The software implementation always generates poor code when the arguments are all constants. In fact, they don't even have to be all constants, just enough of them for the trade off between the cost of a memory load and the cost of assembling the vector from elements to be right.

I think we could probably get by, at least initially with a 16-byte aligned data section and just using movups;

Yes, I didn't sit down to count the cycles yet but it's likely that the code of an unaligned memory load and the cost of assembling the vector from elements is similar.

@tannergooding
Copy link
Member

Maybe I'm missing something but what would that buy us?

I was basically suggesting that, if we were just checking for constants in the importer then we could avoid needing to create a node with 4 or more operands. We could then also deal with that for the non-constant case by importing the correct chain of intrinsics rather than having a specialized InitN intrinsic (at least for HWIntrinsics, this results in the same codegen given how fine-grained they are).

But, I see that dotnet/coreclr#14393 was doing it in lowering; so we have to have a node that can carry all the operands to support that.

@mikedn
Copy link
Contributor Author

mikedn commented Jan 10, 2019

But, I see that dotnet/coreclr#14393 was doing it in lowering; so we have to have a node that can carry all the operands to support that.

Yeah, I'm pretty sure the same approach could also be used during import. But if effort is put into handling this case then why not do it in the right place, in lowering, where you have a chance to catch more cases due to the optimizer producing new constants?

Having to handle more than 3 operands is a bit of an annoyance but it's not that bad. AFAIR the main issue is that I need to store the number of operands in the node itself, because it cannot be determined solely from the intrinsic id. Another bytes (or couple of bits) I need to try to squeeze between the rest of the data, to avoid making the node larger.

Of course, I could just use a linked list, like I'm using for PHI and FIELD_LIST. But considering that most intrinsics have less than 4 operands that seems like a bad compromise.

@mikedn
Copy link
Contributor Author

mikedn commented Jan 10, 2019

Looking back at my comment, I think that my comment about GTF_REVERSE_OPS wasn't very accurate/clear. Even today GTF_REVERSE_OPS does not work on intrinsics with more than 2 operands. If you have more than 2 then the node itself is basically unary so GTF_REVERSE_OPS has no meaning in this case.

@BruceForstall BruceForstall changed the title Improve/cleanup RyJIT's IR handling of operand lists Improve/cleanup RyuJIT's IR handling of operand lists Sep 20, 2019
@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the Future milestone Jan 31, 2020
@BruceForstall BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020
@SingleAccretion
Copy link
Contributor

I would like to finish this work, by deleting GT_LIST and using a array of operands for GT_SIMD and GT_HWINTRINSIC. The benefits have been enumerated in this issue already, but to summarize:

  1. No special cases in IR of nodes that are not actual nodes, no two interpretations of a "use edge" (this, for example, allows to trivially fold GenTree::TryGetUse and GenTree::gtGetChildPointer into one function).
  2. Much more efficient representation in terms of memory usage memory.
  3. More efficient operand access.
  4. As it turns out, cleaner and more streamlined code.
  5. Some other more miscellaneous things like easy VN support for 3+ operand cases.

I propose the following design:

struct GenTreeMultiOp : GenTree
{
    uint8_t m_operandCount; // We have 3 spare bytes in the base GenTree, we'll use 1 to store the count.
}

struct GenTreeMultiOp : GenTree
{
    GenTree** m_operands;
}

struct GenTreeJitIntrinsic : GenTreeMultiOp
{
    GenTree* inlineOperands[2];
    // The rest of the data members fit into 8 bytes, making this a small node.
}

The m_operands array will either point to inlineOperands (when we are constructing a node with <= 2 operands), or a dynamically allocated array when we need more than two. This allows for the accessor function to be simply m_operands[operandIndex], without the checks for the operand count. The downside is that the 3 operand case now needs a dynamic allocation, but I have measured how many nodes will need 3+ operands and came to the conclusion that the faster accessor is a better tradeoff in this case.

I have also made the amount of inline operands configurable (that is why they live in a derived class and not in GenTreeMultiOp itself), so that it can be reused for nodes with a different requirement on the size of the inline array (for example, it would be interesting to see if we can use GenTreeMultiOp for GT_PHI, which could have up to 3 inline operands). It is possible/probable that is not needed, but it is also easy to change.

You can see the fully functional set of changes that will be required to implement this design here. It comes with a ~0.25% reduction in the number of retired instructions when replaying the x64 benchmarks collection, which is a nice bonus.

cc @dotnet/jit-contrib, @tannergooding

@tannergooding
Copy link
Member

tannergooding commented Sep 20, 2021

My own view here (as someone who semi-regularly contributes but doesn't work on the JIT team) is that while this might be a general improvement, its not clear that its the right first step and doesn't look to take GT_CALL into account which is another common case for more than 2 operands.

In particular, I think part of the problem is that there are multiple ways to access and enumerate the operand nodes. That is, you can access the fields directly as gtOp1/gtOp2, you can use the helper functions gtGetOp1()/gtGetOp2(), you can enumerate them as part of the general "children" for a node, etc. How we do this throughout the JIT isn't consistent and that causes all the real problems. IIRC, last time we measured the throughput impact of GT_LIST was minimal compared to the more expensive parts of the JIT and so it wasn't worth focusing on immediately (the same was true of the hwintrinsic name recognition, where not having an optimized binary search barely showed up on the metrics).

Likewise, while we have GenTreeUnOp and GenTreeOp, these concepts are somewhat a lie and don't actually represent 1 or 2 operand cases. Instead, we have various special rules about when its 0, 1, 2, or more operands and you somewhat need to "know" the expected way to check/handle everything. Instead, these two node kinds are largely used to separate some flags and general handling for when the node likely has 1 or 2 operands.

So I think that the right "first step" is to clean this dependence on "internals" up. That is, it shouldn't matter how we internally represent 0/1/2/3+ operands, this should be completely hidden to the general consumer. We should:

  • explicitly have gtOp1/gtOp2 be private fields
  • expose some OperandCount property that indicates the number of operands tracked by the node
    • my thought is that this is "inputs" to the operation represented, so its everything other phases (like valuenum) need to correctly handle the node
  • expose a way to get the operands by index and enumerate through them in order
    • I think its fine for there to still be named helpers for common cases like Op1/Op2 or for say Base/Index on array accesses, etc

Once we have this and everything is going through these APIs for accessing and dealing with operands, then we no longer have any concern about how its implemented internally. We are free to have inline operands or linked lists or dynamically allocated arrays and to freely swap it around without worrying about n places in the JIT getting broken. Likewise, we can actually centralize handling for things like "this has 2 operands, is it commutative" or other "general" checks that are currently guessed at based on being GenTreeOp and which of the operands are null or their kind if not.

@SingleAccretion
Copy link
Contributor

So I think that the right "first step" is to clean this dependence on "internals" up. That is, it shouldn't matter how we internally represent 0/1/2/3+ operands, this should be completely hidden to the general consumer.

Once we have this and everything is going through these APIs for accessing and dealing with operands, then we no longer have any concern about how its implemented internally. We are free to have inline operands or linked lists or dynamically allocated arrays and to freely swap it around without worrying about n places in the JIT getting broken.

I have been thinking about this for the past few days, and here's what what I came to.

First, what we cannot change:

  1. We have to use specialized functions for getting the operands of any particular "class", because it requires knowledge of the node's layout (trivially, you cannot gtGetOp1() of a GT_CALL because it does not store the first operands the way a GenTreeUnOp does it, and switching on the oper is not a solution as it is unnecessarily inefficient, since in the vast majority of cases we know the type of the node in question).
  2. We cannot make all opers use the same node layout for storing operands because, well, different opers have different needs (put another way, different Use classes). The GenTreeOp layout is "ideal" for the most common case of a binary/unary oper, the linked lists (with large Uses perhaps recording the argument info) are appropriate for calls, GT_FIELD_LIST's Use needs a field type, a PHIs Use can be a simple GenTree* (if stored in an array, which may not necessarily be the most efficient representation when we consider the need to update PHIs). When speaking about iterating the operands, you need a case for each type of Use that is in an array (currently all those Uses are simple GenTree*s) and one more general case for a linked list (and, today, a special case for GT_CALL because it uses a mix of both approaches).

Given the above, I simply do not see a way to "fully" unify the operands iteration without TP impact (trivially, getting the ith operand of a linked list requires a linear traversal).

Aside from that, GenTreeMultiOp would only help us here, as it removes one special case - a non-Use-based linked list - and can be iterated the same way a "pure" inline array can (well, not truly, because GTF_REVERSE_OPS exists, but let's say for opers that do not use it...). For example, it allows to rewrite GenTreeUseEdgeIterator in a way that unifies GenTreeMultiOp and all the TriOp/QuadOp opers we have (GT_ARR_OFFSET/GT_CMPXCHG and some others). Whether that's worth it or not is not abundantly clear to me yet, considering the potential TP impact (not in a "real" sense because these opers are very rare, but in an abstract "are we doing the most efficient thing here" one (note that I say that not because I like premature optimization, but because we cannot have undue regressions in code that already works)), but it is something that will be made possible by the proposed changes.

doesn't look to take GT_CALL into account which is another common case for more than 2 operands.

I note that I explicitly did not take calls into account because I do not think it is fruitful at this time to try and make them use arrays (for the reasons mentioned above). Calls have very different requirements to "primitive" operations like the intrinsics represent (which is to say they need "fat" Uses).

@tannergooding
Copy link
Member

and switching on the oper is not a solution as it is unnecessarily inefficient

I think this is worth validating. Ideally for all our scenarios the C++ methods are appropriately marked const and are inlineable in a fashion that this isn't a concern for the release binaries.

That is, having a GenTree level method that switches on the oper kind shouldn't really be an issue, we do this in tons of places already and solving the more general issue of ensuring correctness should be the bigger concern IMO.

Even gtGetOp2 is part of GenTree today and it simply asserts the oper kind and does AsOp()->gtOp2. I don't see anything that would really prevent this or that would prevent the compiler from inlining the check for UnOp vs BinOp vs MultiOp (nor even for the "general" check to be slightly more expensive but having an identically named method on the proper tree kinds that's more efficient).

We cannot make all opers use the same node layout for storing operands because, well, different opers have different needs (put another way, different Use classes).

Yes, but today these details on how its stored leak at every level and that's the main problem that I've seen. Because everything is exposed differently based on the node kind, you have to remember anytime you deal with operands that GenTreeX does it this way and GenTreeY does it this other way and that if GenTreeY.SomeOtherField == ... you need to also check for Z. Its just not tenable and causes a number of downstream complexities.


My main concern here is that GT_LIST itself, while not ideal, isn't that expensive compared to everything else the JIT does. So while its likely goodness, I'm worried its going to make solving the larger problem that we have with GenTreeUnOp, GenTreeOp, GenTreeJitIntrinsic, and GenTreeCall more difficult to handle or at the very least not address it at all.

@SingleAccretion
Copy link
Contributor

I think this is worth validating. Ideally for all our scenarios the C++ methods are appropriately marked const and are inlineable in a fashion that this isn't a concern for the release binaries.

That is, having a GenTree level method that switches on the oper kind shouldn't really be an issue, we do this in tons of places already and solving the more general issue of ensuring correctness should be the bigger concern IMO.

Well, one of the problems is that "opers" represent a hierarchy that the node types are disjoint from. That is, if we are passing a GenTreeOp* to a function, the native compiler cannot be made to know (without explicit checks or fragile reliance on interprocedural analysys) that only values that the gtOper field has are GT_X GT_Y and GT_Z. But even aside from that, the indexing is just not viable for linked lists.

And of course, I note that the proposed changes simplify the intrinsic nodes in this respect: there will only be one way to access their operands (modulo iterators which exist for convenience), using Op(i) (I could've made it a gtGetOp(i), but... I just didn't 😄).

Yes, but today these details on how its stored leak at every level and that's the main problem that I've seen. Because everything is exposed differently based on the node kind, you have to remember anytime you deal with operands that GenTreeX does it this way and GenTreeY does it this other way and that if GenTreeY.SomeOtherField == ... you need to also check for Z. Its just not tenable and causes a number of downstream complexities.

What are the examples of this w.r.t to the operands representation? From what I've seen so far it is "ok", modulo some very special cases that should really just be deleted for good (e. g. some SIMD init with a zero drops the zero, making it implicit), or are fixable without changing the operand layout itself (e. g. how ASG(struct, 0) is assigning TYP_INT to a TYP_STRUCT).

I'm worried its going to make solving the larger problem that we have with GenTreeUnOp, GenTreeOp, GenTreeJitIntrinsic, and GenTreeCall more difficult to handle or at the very least not address it at all.

But why? In other words, what things become more difficult with the proposed changes? The way I see it, it just makes things more uniform and streamlined, and, while not overhauling the Jit, is a step in the right direction (eliminating class of IR nodes that are not actual nodes - the last one is GT_ARGPLACE).

@SingleAccretion
Copy link
Contributor

With #59912 merged, this work has now been completed 🎉!

Remaining simplifications enabled by this change are tracked in source via TODO-List-Cleanup.

@ghost ghost locked as resolved and limited conversation to collaborators Dec 20, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI enhancement Product code improvement that does NOT require public API changes/additions JitUntriaged CLR JIT issues needing additional triage
Projects
None yet
Development

No branches or pull requests

6 participants