Pluggable Packing Representation in IREE #12075
Replies: 3 comments 5 replies
-
Looking through this, this is well in line with everything that is already being considered for implementing data-tiling in IREE (please see details below). I think maybe some of the implementation details today are being over-emphasized. I think everything in here is already pluggable in IREE and was always the plan.
That is just what is used today cause the 2D data-tiling on
This all seems like things that should happen in the Codegen backend, and is already what is being done with
The current use the enum encoding attributes are a placeholder. Literally the simplest thing. In terms of complexity, having a single attribute that has a bunch of optional fields does not seem that different from having any different number of Enums. I think the only easily extensible state here is through use of AttributeInterfaces. I am not opposed to either anyway.
This is just the current implementation and will be fixed soon. IREE is still using the
I dont follow the details here. The
I think we should ignore this part. The whole point of the |
Beta Was this translation helpful? Give feedback.
-
Thanks for writing this up. I am excited to see where this is going! What @nicolasvasilache described is what we are doing. First, we have a set of "good" packing decisions for different operations like matmul or conv2d, currently hard-coded. Then, after packing, we try to push |
Beta Was this translation helpful? Give feedback.
-
Hi Nicolas,
Great write up! Some comments inline.
On Mon, 6 Feb 2023 at 17:21, Nicolas Vasilache ***@***.***> wrote:
Step 3. and 4. (delayed packing materialization and dispatch region
creation) are a cornerstone of IREE.
They aim to provide a good enough fixed graph partitioning for all target
hardware (CPU, GPU, mobile, all current and future accelerators).
As a consequence, steps 3. and 4. should never know anything about the
underlying hardware.
This may be the case in IREE, but I disagree this is a strong general
property.
While your high-level pass does annotate the ops, not knowing what is the
underlying hardware limits the ability to do high-level transforms that
would be beneficial to code generation down the line.
High-level decisions, for example how to partition your graph (around
propagation), will make following code-gen decisions from easy to
impossible. This is less important when lowering to CPUs, because it's the
same ISA and (mostly) same virtual memory, but when lowering to GPUs or
accelerators, some compute still remain on CPUs, and knowing the trade-offs
at partition time will make you find better boundaries and lead to better
code generation of the partitions down the pipe.
The alternative is to carry dozens of attributes in each instruction. This
is not only fragile, but can lead to inconsistent messages that the
back-end has no option but to drop it on the floor and move on.
Sometimes, the LLVM compiler itself may get in the way
<https://discourse.llvm.org/t/understanding-and-controlling-some-of-the-avx-shuffle-emission-paths/59237>
and adding inline asm may be necessary for unlocking performance (see table
3 p25 <https://arxiv.org/pdf/2202.03293.pdf>).
I am wondering how we should proceed to go beyond matmul and:
1. Devise a generic algorithm that does not require introducing a new
named op for every new case.
2. Automate IREE-specific integrations, without having to manually
write the C++ logic each time.
Inline assembly moves the library's job to the user and it's never a
"good" solution, especially if the users are ML/HPC researchers who only
know Python or Octave.
The best move is to the compiler side, knowing where to inject
micro-kernels in the right places, like both our projects are aiming at. To
me, this is the only acceptable stop-gap, until compilers can select the
right instructions for every case.
Alternatively, the approach followed by Intel with their TPP work is to
propagate tensor.pack / tensor.unpack operations aggressively through the
graph. By making good packing decisions on various conv, brgemm and
matmul flavors, it is my understanding that:
1. MLP models are fully folded away (into weights and the 1 input
tensor) and aligned
2. ResNet (50?) has only 7 tensor.pack and 1 tensor.unpack left.
Maybe they could shed more general light on other high order bits (
@chelini <https://github.com/chelini>).
The quick overview is that TPP sits in between previous Intel efforts to
create catch-all libraries (TBB, MKL, etc) and intrinsics / inline
assembly. We want this micro-kernel library to act as an ISA between
optimizing compilers and varied complex hardware choices (CPUs, GPUs,
accels). So far, it has worked (by hand in the paper and via compiler in
PlaidML) for key Intel, AMD and Arm CPUs and we have an effort looking into
GPUs and beyond.
What previous work has done (and why we need to replicate), is to
completely remove packing from the model except a single input packing /
output unpacking pair. Weight shapes are static throughout the model, even
if they get repacked during training, they're still the same shape for
every backward pass. So what remains is reshaping the input as it come in
and the output as it goes out, which most models already do beforehand, so
even that could be removed.
For inference, static pre-trained weights get packed at compile time and
re-written to the model file (MLIR, protobuf). For training, it will depend
on initialization. If there has been a pre-training, you need to pack only
once you start training. If they've been randomly initialized, then you
probably don't even need a re-layout.
Resnet 50's remaining packs are due to missing patterns, as they should all
go away too.
PlaidML can do all that already, but the rest of its technology is stuck in
the past. We're extracting the TPP value from it and putting it in MLIR
upstream, where it can continue to have an impact for many years to come.
…--renato
Message ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Hi everyone,
I would like to start a discussion on how to represent and manipulate data-centric transformations such as packing, in a fashion that is both cognizant of IREE requirements and general enough. By general, I want to capture at least the fact that we shouldn't want to manually write dozens of special cases in our compiler and instead have one intuitive way to cover a large number of cases.
In addition to the conciseness aspect, there are also normalization, retargetability and extensibility aspects that I unpack below (pun intended) and that I would like to socialize better to reach a common understanding. I am hoping that the vision I describe that connects the elements below forms the basis a codegen vision that you'd find compelling and future-proof.
This is related to #11821 and other posts about the special case of data tiling that IREE implements today (i.e. transforming a
linalg.matmul
into alinalg.mmt4d
to then call a library (link needed)).General Background
Data layout transformations are a well-known class of transformations aimed at obtaining high-performance by reorganizing data to match hardware data transfer characteristics.
They are important because a computer is essentially a machine that moves bits form a large storage (e.g. datacenter) to the place where it can process the data (often registers at the heart of the CPU), computes and then communicate results back.
Each interface between 2 storage media is an opportunity for crippling inefficiencies. Like in physics, such inefficiencies compound in a multiplicative fashion. The rules that govern data costs are driven by storage capacity and temporal + spatial locality (i.e. data reuse and compulsory misses are key metrics to optimize for).
To avoid order of magnitude slowdowns, proper data layout and alignment needs to be considered at all levels of the hardware hierarchy, e.g.
Specific harwardare features may allow to cut through some layers of abstraction (e.g. hardware prefetchers, write-back / write-through caches, DMA engines, CPU sockets, CUDA cp.async, NVLink, ...) or add additional layer of complexity (OS memory management / TLB, ...).
When talking about data packing, in IREE codegen, for now, I think we mostly care about handling:
memory -> L2 -> L1 -> register
case.linalg.matmul
tolinalg.mmt4d
for which we want to reuse the Ruy implementation.I know from discussions with @benvanik that he thinks a lot about all the other levels of the stack but for the sake of simplicity I'll only cast a net around
memory -> Ln -> registers
, which is a separable part of the problem. Additionally, packing solutions to thememory -> Ln -> registers
problem compose with solutions for the higher-level part of the stack.IREE Codegen
This is my attempt at characterizing ongoing efforts in IREE and making sure I, and others who have not been involved in the specifics of the IREE decision process, can get up to speed (if something is inaccurate please let me know so I can update my priors):
linalg.matmul
operations that we want to convert tolinalg.mmt4d
.SetEncoding.cpp
,MaterializeEncoding.cpp
andMaterializeEncoding.cpp
.linalg.matmul
case.Step 3. and 4. (delayed packing materialization and dispatch region creation) are a cornerstone of IREE.
They aim to provide a good enough fixed graph partitioning for all target hardware (CPU, GPU, mobile, all current and future accelerators).
As a consequence, steps 3. and 4. should never know anything about the underlying hardware.
Properties Important for Packing
Stepping back from the specific use case that we have in IREE so far, I would like to characterize the general factors influencing packing decisions, I may be missing some:
linalg.matmul
or alinalg.transposed_lhs_matmul
(assuming such a named op is introduced).lhs
,rhs
andres
of alinalg.matmul
have different layouts.linalg.matmul
requires a different packing than a square one. For example, if a dimension of amatmul
is small (e.g. small batch size of4
), packing to say16
immediately limits us to25%
of peak. Sometimes, this is sill what we may want on very specific accelerators for which no other implementation is competitive.memory -> Ln -> register case
this comprises: register size, number of registers (i.e. volume of register storage), alignment, cache line size, cache line capacity and associativity (i.e. volume of cache storage).The case of
matmul
has been heavily studied for decades and the landscape is "relatively" clear and unambiguous.For instance,
matmul_f32_f32_f32
onAVX-512
run well with anm,n,k
of size16x16x1
, iterating on as large ak
dimension that fits inL1
(see figure 16, which includes the cost of pack / unpack on the fly).Extensions to other data types and other CPU ISAs should be relatively straightforward, provided the right abstractions are added in MLIR (see the discussion on HW-specific and retargetable vector dialects in the vector dialect rationale).
Sometimes, the LLVM compiler itself may get in the way and adding inline asm may be necessary for unlocking performance (see table 3 p25).
Matmul is a crucial kernel to get right in the system and it is great we are pursuing the
mmt4d
path.I am wondering how we should proceed to go beyond matmul and:
Generalizing Packing
Recently started investigations in mapping convolutions triggered the desire for an algorithm that can more generally detect a contraction in any generic or named op (see this commit).
Here is what this looks like:
The TL;DR is that a single transformation can find a
gemm
within alinalg.generic
and pack it according to the parametersgemm_packed_sizes
andgemm_inner_dims_order
.One interesting aspect here is that this acts as a normalization step: in the 3 examples above, different input
linalg
ops are packed to the same8x16x32
form.As a side note, this also works out of the box with other ops (e.g. batch contracted op or conv_2d).
I believe this provides a level of simplification that will help us greatly with generalization.
An additional interesting observation is that this transformation is significantly more general than just the
memory -> Ln -> registers
level and I expect it will help us with distribution at various levels of the hierarchy: packed parallel (resp. reduction) iterators generalize to hierarchical parallel (resp. reduction) iterators. Sliding windows are a bit trickier and require either "ghost regions" or modular arithmetic to pack and distribute but should not be fundamentally out of character.I expect simple evolutions of this
pack_greedily
transformation will include:conv_1d
as such an operation that packs a serious punch on CPUs. This is interesting becauseconv_1d
has significantly more reuse thanmatmul
and saturates compute units even faster than gemm. This does not require a memory blowup transformation such as im2col, whose effect is to reduce memory intensity to allow mapping togemm
.gemm
,conv_1d
etc, the transform can still normalize the memory accesses of anylinalg.generic
by setting it up with guaranteed contiguous and aligned memory accesses.Strawman Usage in an E2E Codegen Flow
Here is a strawman of how we can use this to get target-independent reliable performance backstop that bottoms out on innermost tile-level of known performance, no spilling, no unnecessary loop peeling or unrolling, etc:
1
(except the "last dims involved"), map and fold to whatever level of processor HW hierarchy is available1
(except the "last dims involved")1x...x1x8x16x32
(strawman gemm case with8x16x32
packing), vectorize ithw_specific.vector
ops orinline_asm
ops.The above is a pure retargetable codegen backstop and will get to (conservatively) 5-10%, (optimistically) 10-30%of available perf reliably if step 8. is connected properly.
To get reliably higher, we will need to plug in a little larger tile sizes and plug classical transformations like interchange, promotion to fast local memory, better hoisting and pipelining.
The last k steps can also be replaced by "call a handwritten assembly kernel" when it makes sense (as I mentioned LLVM may get in the way).
Later: rewrite top-N parallel loops into a space filling curve for better cache locality and last mile optimizations.
A Few Extra Words On Generalized Packing
Additionally, the introduction of
tensor.pack
andtensor.unpack
operations provides the supporting op to implement the 3rd type oflinalg
tiling that we have been missing until now: tiling an N-D op into a 2*N-D op without introducing loops. This can also unlocklinalg
nesting and generally improve composition.Looking forward, this also connect with structured codegen beyond rectangular arrays when we can support more fancy data types.. I expect compression will require some of that but don't worry about it for now...
Request for Advice for Connecting Properly to IREE
I would like to be able to connect generalized packing-based codegen strategies for ops that have more ambiguity than "
matmul
in a size regime that is already good formmt4d
".I may not know the name of an op ahead of time, the proper packing sizes may vary based on all the factors in the "Properties Important for Packing" section.
Even in the mm4d case, we know we will want multiple different packings for different problem size regimes (e.g. see slide 153 in this older presentation for sizes up to 256^3).
This is where I would be interested in advice and help to connect this work better to IREE and avoid manual intervention in the compiler for every case.
Pluggable Type Attribute or Interface
Can we think of an extensible mechanism to connect different packing strategies to backend compilation, without having to hardcode them in C++ ?
Could we evolve the current fixed
load-bearing op name + operand packing name + hardcoded C++ mapping to magic numbers
?The information I would need to unambiguously delay the creation of packing resembles:
Now I am not particularly thrilled about the complexity of this potential Attribute / Interface but it would do the job without requiring SSA values.
In the absence of such an attribute, the following works for me but are not ideal in an IREE context:
apply packing as a preprocessing step (e.g.
mlir-opt
oriree-opt
are both fine).lower
tensor.pack
/tensor.unpack
to a mix oftensor.expand/collapse_shape
,linalg.fill
andlinalg.transpose
. This currently needs to happen because IREE does not legalizetensor.pack
/tensor.unpack
ops and fails when presented with such IR. There could be opportunities to do a graph leel rewrite (see next section).(sometimes) locally disable some heuristics of dispatch region formation to avoid interfering with the lowered
transpose
op, there are 2 cases here:i.
linalg.generic
ops that implement alinalg.transpose
seem to always be fused on inputs. This undoes the transposition part of thepacking
and defeats the purpose of the transformation. This seems to reliably not occur when using a reallinalg.transpose
. In general, I think IREE should not be so dependent on hardcoded names but there seems to be an easy mitigation.ii. whatever the form of
linalg.transpose
orlinalg.generic
implementing the transpose, it seems to always be fused into the output. This is an interesting tradeoff: (a) on one hand, this resolves transposition immediately without breaking the layout; (b) on the other hand, the iterators order is changed and the normalization property is lost (i.e. the most minor op iterator dimensions are not the ones that participate in thegemm
anymore). However it may be possible to recover it.To make things more concrete, here is a draft PR #12076 that illustrates the various point in this discussion.
One may appreciate the fact that this PR has 0 lines of C++ and is purely a composition of upstream transformations with IREE's compilation flow.
Benchmarking
A much more modest question: I have a specific use case for performance collection that I am not sure how IREE supports: I would like to run IREE on a few different compiled variants of the same dispatch region and look at their performance, sort and dig deeper into assembly. On CUDA GPUs this is easy thanks to nsight but on CPU I am unclear. I talked to @qcolombet who used tracy but it seems this is not ideal for this use case.
Atm, am I better off (a) stitching all my IRs in a single compilation unit with a dozen dispatches + dump intermediate files + manually figure it out from there or (b) is there a better suggested alternative?
Looking Beyond: Graph-Level Optimization
In the grander picture, the discussion above omits a key aspect: how to make good global packing decisions.
At the
memory -> Ln -> register case
, my understanding is that IREE is focusing on the "fuse padding with producers" approach which has been proven for matmul in Ruy. This is designed to fit within the IREE constraints of delayed materialization of packing and attributes.Alternatively, the approach followed by Intel with their TPP work is to propagate
tensor.pack
/tensor.unpack
operations aggressively through the graph. By making good packing decisions on variousconv
,brgemm
and matmul flavors, it is my understanding that:tensor.pack
and 1tensor.unpack
left.Maybe they could shed more general light on other high order bits (@chelini).
A third data point worth mentioning, if people are not familiar with the topic, is the TASO work from Stanford. This contribution features a "Graph Rewriter and Data Layout Joint Optimizer" (see Figure 1).
Lastly, Jim Demmel's communication-avoiding algorithms (i.e. distributed recomputations to avoid communications) also lurk in the shadows here, if the hardware has enough compute vs communication imbalance. It is my expectation Hopper will be close to that regime and will also be much programmable with higher-D operations that match the hardware hierarchy.
I don't want to deviate further from the topic at hand but I wanted to point out that tiling N-D to 2*N-D ops can also have large scale implications on the graph level. If these specific topics are of interest to IREE, let's start another discussion.
Thanks For Reading !
Questions, comments?
@stellaraccident @mattwalsh @benvanik @MaheshRavishankar @silvasean @jpienaar @ftynse @ThomasRaoux @qcolombet @chelini @rengolin
Beta Was this translation helpful? Give feedback.
All reactions