[DISCUSS] Narrowing with opaque buffer access #358

spectrometerHBH · 2021-04-05T15:12:11Z

We start with the pipeline of TensorCore tensorization.

Currently, the TensorIntrin for fill_fragment is

@tvm.script.tir
def wmma_fill_desc(c: ty.handle) -> None:
    C = tir.match_buffer(c, (16, 16), "float32", align=128, offset_factor=256, scope="wmma.accumulator")
    with tir.block([16, 16], "root") as [vi, vj]:
        tir.bind(vi, 0)
        tir.bind(vj, 0)
        for i, j in tir.grid(16, 16):
            with tir.block([16, 16], "init") as [vii, vjj]:
                tir.bind(vii, vi + i)
                tir.bind(vjj, vj + j)
                C[vii, vjj] = tir.float32(0)


@tvm.script.tir
def wmma_fill_impl(c: ty.handle) -> None:
    C = tir.match_buffer(c, (16, 16), "float32", align=128, offset_factor=256, scope="wmma.accumulator")
    with tir.block([16, 16], "root") as [vi, vj]:
        tir.bind(vi, 0)
        tir.bind(vj, 0)
        tir.reads([])
        tir.writes(C[0: 16, 0: 16])
        tir.evaluate(tir.tvm_fill_fragment(C.data, 16, 16, 16, C.elem_offset // 256, tir.float32(0), dtype="handle"))

In fact, this wmma_fill_impl of TensorIntrin doesn't work for non-packed layout.
The semantic of wmma_fill_desc is that given starting position [vi, vj], fill C[vi: vi+16, vj: vj+16] with 0.
Hence a semantically equivalent wmma_fill_impl should be

@tvm.script.tir
def wmma_fill_impl(c: ty.handle) -> None:
    C = tir.match_buffer(c, (16, 16), "float32", align=128, offset_factor=256, scope="wmma.accumulator")
    with tir.block([16, 16], "root") as [vi, vj]:
        tir.bind(vi, 0)
        tir.bind(vj, 0)
        tir.reads([])
        tir.writes(C[vi: vi + 16, vj: vj + 16])
        tir.evaluate(tir.tvm_fill_fragment(C.data, 16, 16, 16, vi // 16 * C.shape[-1] // 16 + vj // 16, tir.float32(0), dtype="handle"))

Note that the 5th argument is the index of warp buffer. In the high-level programming model, we operate on 16x16 subregions of a complete large buffer. But in the low-level programming model, the compiler will cut these warp pieces into separate 16x16 warp buffers, hence we need an index to designate which piece we are operating on.

This brings trouble for narrowing. Narrowing will change the shape of C, hence will require recalculating the index argument of tir.tvm_fill_fragment.

The problem is that we don't know how to rewrite the expression vi // 16 * C.shape[-1] // 16 + vj // 16.
Suppose the starting position of the narrowed buffer C' is [i0, j0]. The correct expression after rewrite should be ((vi - i0) // 16 * C'.shape[-1] // 16 + (vj - j0) // 16.

I propose two methods for this problem

M0. Using MatchSubRegion

#130 proposes MatchSubregion, which aims to deal with the same problem, but it mainly handles the opaque access of buffer fields like C.elem_offeset. vi and vj are also variables that need to recalculate according to the new starting point.

We need to give the compiler hints on which part of the body will be affected by narrowing. We can use new tir Ops like tir.relative(vi, 0), tir.relative(vj, 1).

M1. Using new tir Ops

This method directly utilizes new Ops. We can do with two different ways

M1.1 Introduce tir.tile_index(vi, vj) to directly represent the whole expression.

M1.2 Similar with M0, we introduce tir.relative(vi, 0), but we use tir.get_shape_dim(C, dim=-1) for C.shape.

cc @tqchen @vinx13 @Hzfengsy @junrushao1994

The text was updated successfully, but these errors were encountered:

spectrometerHBH · 2021-04-05T15:47:50Z

Also, I point out another problem here, which is related to check of tensorization.

In the above TensorCore case, the checks we want to do are

C0. The shape of wmma buffer after narrowing is divisible by 16.
C1. The starting position of wmma operation is divisible by 16.

These two checks can ensure that when the compiler can successfully break the whole wmma buffer into 16x16 fragments.

I have no clear idea how to state these two checks in TensorIntrin, if we want general TensorIntrin support.

tqchen · 2021-04-05T17:30:55Z

In this particular case, there is a mapping from 2D index(vi, vj) into the one dimensional index space. it would be great if our schedule template do the mapping instead of relying on the tensorizer(since mapping the two dimensional into a single dimension is not something that hardware provide -- since we have one dimensional index).

// Use this layout for compact compute that works on tensorization
=> Ccache[floordiv(i,16)][floordiv(j, 16)][i%16][j%16]  => C[i][j]

vinx13 · 2021-04-06T04:14:33Z

Example we discussed today:
To use the intrinsic llvm.amdgcn.mfma.f32.16x16x16f16 (A: half4, B: half4, acc: half4) -> half4, (this intrinsic computes 16x16x16 matmul using 64 threads) each thread need to load half4 from A and half4 from B. We will use tir.store and tir.load to perform vectorized load. Since tir.store/load uses flattened one-d access, we need to use MatchSubRegion for A and B (A and B are both buffer sub region of shape [4,]), such that in buffer flatten correct offset will be added.

junrushao · 2021-09-10T21:29:15Z

I assume it's done on mainline? @Hzfengsy

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DISCUSS] Narrowing with opaque buffer access #358

[DISCUSS] Narrowing with opaque buffer access #358

spectrometerHBH commented Apr 5, 2021 •

edited

Loading

spectrometerHBH commented Apr 5, 2021 •

edited

Loading

tqchen commented Apr 5, 2021 •

edited

Loading

vinx13 commented Apr 6, 2021 •

edited

Loading

junrushao commented Sep 10, 2021

[DISCUSS] Narrowing with opaque buffer access #358

[DISCUSS] Narrowing with opaque buffer access #358

Comments

spectrometerHBH commented Apr 5, 2021 • edited Loading

M0. Using MatchSubRegion

M1. Using new tir Ops

spectrometerHBH commented Apr 5, 2021 • edited Loading

tqchen commented Apr 5, 2021 • edited Loading

vinx13 commented Apr 6, 2021 • edited Loading

junrushao commented Sep 10, 2021

spectrometerHBH commented Apr 5, 2021 •

edited

Loading

spectrometerHBH commented Apr 5, 2021 •

edited

Loading

tqchen commented Apr 5, 2021 •

edited

Loading

vinx13 commented Apr 6, 2021 •

edited

Loading