-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DISCUSS] Narrowing with opaque buffer access #358
Comments
Also, I point out another problem here, which is related to check of tensorization. In the above TensorCore case, the checks we want to do are C0. The shape of wmma buffer after narrowing is divisible by 16. These two checks can ensure that when the compiler can successfully break the whole wmma buffer into 16x16 fragments. I have no clear idea how to state these two checks in TensorIntrin, if we want general TensorIntrin support. |
In this particular case, there is a mapping from 2D index(vi, vj) into the one dimensional index space. it would be great if our schedule template do the mapping instead of relying on the tensorizer(since mapping the two dimensional into a single dimension is not something that hardware provide -- since we have one dimensional index). // Use this layout for compact compute that works on tensorization
=> Ccache[floordiv(i,16)][floordiv(j, 16)][i%16][j%16] => C[i][j] |
Example we discussed today: |
I assume it's done on mainline? @Hzfengsy |
We start with the pipeline of TensorCore tensorization.
Currently, the TensorIntrin for
fill_fragment
isIn fact, this
wmma_fill_impl
of TensorIntrin doesn't work for non-packed layout.The semantic of
wmma_fill_desc
is that given starting position [vi, vj], fill C[vi: vi+16, vj: vj+16] with 0.Hence a semantically equivalent
wmma_fill_impl
should beNote that the 5th argument is the index of warp buffer. In the high-level programming model, we operate on 16x16 subregions of a complete large buffer. But in the low-level programming model, the compiler will cut these warp pieces into separate 16x16 warp buffers, hence we need an index to designate which piece we are operating on.
This brings trouble for narrowing. Narrowing will change the shape of C, hence will require recalculating the index argument of
tir.tvm_fill_fragment
.The problem is that we don't know how to rewrite the expression
vi // 16 * C.shape[-1] // 16 + vj // 16
.Suppose the starting position of the narrowed buffer C' is [i0, j0]. The correct expression after rewrite should be
((vi - i0) // 16 * C'.shape[-1] // 16 + (vj - j0) // 16
.I propose two methods for this problem
M0. Using MatchSubRegion
#130 proposes MatchSubregion, which aims to deal with the same problem, but it mainly handles the opaque access of buffer fields like
C.elem_offeset
. vi and vj are also variables that need to recalculate according to the new starting point.We need to give the compiler hints on which part of the body will be affected by narrowing. We can use new tir Ops like
tir.relative(vi, 0)
,tir.relative(vj, 1)
.M1. Using new tir Ops
This method directly utilizes new Ops. We can do with two different ways
M1.1 Introduce
tir.tile_index(vi, vj)
to directly represent the whole expression.M1.2 Similar with M0, we introduce
tir.relative(vi, 0)
, but we usetir.get_shape_dim(C, dim=-1)
forC.shape
.cc @tqchen @vinx13 @Hzfengsy @junrushao1994
The text was updated successfully, but these errors were encountered: