Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DISCUSS] Thoughts on Correctness Validation #45

Closed
Hzfengsy opened this issue Mar 13, 2020 · 8 comments
Closed

[DISCUSS] Thoughts on Correctness Validation #45

Hzfengsy opened this issue Mar 13, 2020 · 8 comments

Comments

@Hzfengsy
Copy link
Member

Hzfengsy commented Mar 13, 2020

After frequent discussion with @spectrometerHBH, we find there are still some problems with the correctness validation.

Before that, we usually suppose that the block can get correct input as long as the dependency is satisfied (e.g. the consumer block must execute after the producer block). However, it is not always the truth, especially when the producer block will update the output tensor. Here is an example

for i0 = 0 to 16
  for j0 = 0 to 16
    block(vj=j, ..., name="A")
      A[j0] = A[j0] + 1

for i1 = 0 to 16
  for j1 = 0 to 16
    block(...)
      B[i1][j1] = A[j1]

In this case, we can not compute_at block B under the Loop j0 even though the dependency is satisfied. Here is the wrong code after comput_at.

for i0 = 0 to 16
  for j0 = 0 to 16
    A[j0] = A[j0] + 1
    B[i0][j0] = A[j0] 

The problem happens just because the producer block has not produced the final correct output during the loop. The same problem will also happen in reorder with branches. Of course, we are not trying to support the case, but we need to ban it from the legal schedule.
But it is not an easy thing either.

Proposal 0

The most radical solution is to forbid the schedule for the IR comes from the hybrid. It will solve every correctness problem since we only support the schedule for te.compute. It is crazy and we will not use it when we have any other way. It would be our hold card.

Proposal 1

The key point is that the producer can not provide validating results. We would like to make sure all the producer blocks are complete. Here is my proposal:

  • Change the definition of the complete block:
    • The only producer for every output tensors
    • The block can not read from any writing tensors
    • The block_var related to the output region must be data_par(not all the block_var)
  • All the producer blocks must be complete during compute_at;
  • All the blocks to be Decomposed should be complete during reorder;
  • Introduce the reduction operator such as !+= to make the reduction (init update) to become a complete block.

Since then, the complete block equals the current tvm stage, even the reduction block. We can forbid every risky operation which is not allowed with the complete restriction.

Proposal 2

An insight we can get from the counterexample above is that we haven't come up with a validation mechanism for block and its surrounding loops. Actually, the block A's surrounding loops i0, j0 are not legal. The reasons are stated below.

The spirit of TIR is to express info in the block and generate loops outside the block to satisfy the constraints. Besides range information, we also have to check the binding function, i.e. (vi, vj, ....) = f(i, j, ...). range only tells us which subset of the instance space of block may be run by the outside loop, but it can't tell how many times a specific running instance will be touched(maybe 0, maybe larger than 1). In the counterexample above, each instance of block "A" will be run many times because of i0. Hence if we pose no more constraints, the block info is not complete.

One reasonable constraint for binding function f is that it builds a 1-1 correspondence between the space of (i,j,k...) and the space of (vi,vj,vk...), i.e. The loop iteration and binding function will iterate all the instances inside the block var space exactly once. The previous complete block definition + this 1-1 correspondence constraint will make sure it's safe to compute_at.

But it brings difficulty to check this constraint. Since split and fuse will make the binding function rather complicated and it's difficult to come up with a simple pattern that is compatible with current binding we have to support and at the same time is provable to be a 1-1 correspondence.

One algorithm I come up with is that we only consider the following bindings to be legal

1.  vi=c1*i+b1, vj=c2*j+b2, vk=c3*k+b3 ... (one loop_var binds one block_var linearly)
2.  if f is a legal binding and g is the binding after we applying `split` on f, then g is legal
3.  if f is a legal binding and g is the binding after we applying `fuse` on f, then g is legal

and given a binding, we try to reverse the split and fuse effect to get vi=c1*i+b1, vj=c2*j+b2, vk=c3*k+b3. If we fail, then it is not legal.

Between those 2 (excluding Proposal 0), we both prefer Proposal 1 for now.

@tqchen

@tqchen
Copy link
Contributor

tqchen commented Mar 13, 2020

The root of the problem is still about trying to run compute_at of a block that is not a complete provider. Proposal 1 sounds good.

Note that we can go further to forbid reduction in a complete block and force user to blockize. The init-reduction operator might introduce a sugar that is equivalent to(blockize, compute_at, unblockized)

@spectrometerHBH
Copy link
Collaborator

spectrometerHBH commented Mar 13, 2020

I've updated proposal 2. I will post my proofs for proposal 1 tomorrow.

@spectrometerHBH
Copy link
Collaborator

spectrometerHBH commented Mar 14, 2020

Proofs for current primitives

1. Definitions

1. iter_var

iter_var is a tuple of (min, extent, iter_type), which denotes a loop ranges in [min, min+extent). iter_type poses dependency constraints on the running instances under the loop.

2. iter_type

There are now 3 kinds of iter_types: data_par, reduce, ordered.
Suppose we have a loop shown below.

for i in S
  T[i]
  • data_par: For any i≠j in S, T[i] and T[j] are reorderable and parallelizable. There is no RAW/WAW/WAR dependency between these instances.
  • reduce: For any i≠j in S, T[i] and T[j] are reorderable but not parallelizable. There are RAW/WAW/WAR dependencies between these instances.
  • ordered:For any i≠j in S, T[i] and T[j] are neither reorderable nor parallelizable. There are RAW/WAW/WAR dependencies between these instances.

3. block_var

block_var is of type iter_var.
Suppose we have a block with block_vars declared as below

block(block_var_info={v1, v2, ... vm})
  body[v1, v2, ... vm] #running instance parameterized by block_vars

Then

for v1 in [min_1, min_1+extent_1) # iter_type_1
  for v2 in [min_2, min_2+extent_2) # iter_type_2
    ...
      for vm in [min_m, min_m+extent_m) # iter_type_m
         body[v1, v2, ... vm] #running instance parameterized by block_vars

is the default legal surrounding loops which pose constraints on body[v1, v2, ... vm].

We expect the actual loops binding this block to iterate these running instances satisfying the constraints.

4. Validatable loop binding

We expect the actual loops binding this block to iterate these running instances satisfying the constraints.

However, it is difficult to check an aribitrary set of loop bindings, so we define validatable loop binding here.

A set of binding expressions is validatable if

  • vi=i, vj=j, vk=k, ... (one loop_var binds one block_var exactly)
  • if f is a legal binding and g is the binding after we applying split on f, then g is legal
  • if f is a legal binding and g is the binding after we applying fuse on f, then g is legal

Note: Non-validatable loop binding will also work in IR, but may limit the schedulability of the block

5. dependency between blocks

Dependency are defined between blocks under the same block scope.

Consider Block B and Block C under the same scope,

B is C's predecessor if

  • B appears before C in the DFS order
  • (RAW predecessor) C reads the output tensor that B writes
  • (WAW predecessor) C writes the output tensor the B writes
  • (WAR predecessor) C writes the output tensor that B reads

The definition of successor is similar.

Note: WAR dependency is not allowed in the IR now

6. dominant block

A block B is dominant if

  • It's the only producer for every output tensors
  • (Region Cover Property) Suppose B is a RAW predecessor of C, Loop k is the LCA of B and C, then B's output region covers C's input region under Loop k

Properties of dominant block

    1. No matter when and where a consumer reads the output of a dominant block after the full execution of a dominant block, the consumer will see the same input.
      Proof: Otherwise there must exist some other block modifies the output of the dominant block

7. complete block

A block is a complete block if

  • It is dominant

  • The block can not read from any tensor it writes

  • All the block_var must be data_par (Note: Stronger than Proposal 1)

Properties of complete block:

    1. Different running instances in the block var space will write different positions of the output tensor.
      Proof: otherwise there will be WAW dependency between instances, which violates the definition of data_par
    1. We can iterate the whole instance space in arbitrary order.
      Proof: clear by definition of data_par.
    1. Give input, no matter how many times we iterate the whole block instance space, we will get the same output.
      Proof: A tensor is either read or written but not both, so the tensor being read is invariant during execution. As for a tensor being written, one specific position will be written by one instance using a constant expression(involving tensors being read and literals).
    1. No matter when and where a consumer reads the output of a dominant block after the full execution of a dominant block, the consumer will see the same input.
      Proof: Inherited from the property of the dominant block
    1. (Decomposable Data Flow) If a loop k has only complete blocks below, than loop k is fully decomposable.

      Proof: We can prove by induction on the number of blocks below.

      IMG_0634(20200326-122803)

      If there is only one block below, then we are done.

      If there are N blocks below, we number them by DFS order as block1, block2, ..., block N. We show that block 1 is decomposable from loop k.

      • WAR is not allowed + block1 will not read the output of itself

        => block1's input is invariant during the whole excution of loop k

      • What's more, block1 is complete

        => Whenever block1 writes an output position, it will write the same value

      • Block1 is complete => Region cover property
        => Whenever block i(i > 1) reads the output tensor of block1, the position has been writen by block1

8. Reduction Block

A block is a reduction block if

  • It is dominant

  • All its block vars are data_par or reduce, but the block_var related to the output region must be data_par

  • All writes to the output positions are incremental (Note: we can decide a way to declare this special kind of write later)

Properties of reduction block:

    1. For a specific output position, it will not read the value of any other output position. In other words, we can only write a tensor position using its own value and other invariant values during execution.
      Proof: Otherwise it will violate the definition of data_par
    1. Running instances with the same values for data_par block vars but with different values for reduce block vars will write the same output position
      Proof: Clear from the definition

Primitives

1. split

Mutate:

  • i -> i_1 * factor + i_2, introduce predicate when necessary.
    Proof: The iteration order and iteration range of block var space are invariant.

2. fuse

Check:

  • check the outside loop is single-branch

Mutate:

  • (i1, i2) -> i, i1=i / factor, i2=i % factor

Proof: The iteration order of block var space under inner loop is invariant.

3. reorder

Check:

  • check loops are in the same line and are single-branch
  • the block below has all its block_var to be data_par or reduce.

Proof: The reorder of outside loops changes the iteration order of block instance space, but all block_var are data_par and reduce means it brings no problem to do so.

Note: This works naturally for reduction blocks, but it is related to how we choose the zeroing position, see comment below.

4. compute_at(input_block, input_loop)

First set of Checks:

    1. check input_block is complete
    1. check input_block, its predecessors, RAW successors are not in the same loop tree
    1. check dependency: all input_block's RAW successors are under input_loop
    1. check input_block is not the output block of the current scope

Mutate:

    1. generate loops that produce what input_block's successors

Proof of First set of Checks:
IMG_0628(20200325-125803)
i => input_block only has RAW successors and RAW predecessors
i => No block will write the output of input_block except the input_block itself
i => No block will write the input of input_block except the RAW predecessors
i + ii => input_block will read the same input as before.
i + ii + iii + v + region cover property => consumers of input_block will read the same input as before.
iv => It is safe for the input_block to produce more than before(to void cross scope problems).

Second set of Checks:

    1. check input_block is complete
    1. check the loop tree of input_block contains only complete blocks
    1. check dependency: all input_block's RAW successors are under input_loop
    1. check input_block is not the output block of the current scope

Proof of Second set of Checks:
Note that the second set of checks differs only in the ii with the first set of checks.
In the beginning, the predecessors, input_block and the successors may be in the same loop_tree. But if we apply property5 of complete block, the loop_tree of input_block is fully decomposable. then we can apply the first set of checks.

5. rfactor(input_block, reduce_loop, factor)

Check:

  • check loops under the reduce_loop are all single-branch
  • check the input_block is reduction block
  • check the reduce_loop binds a reduce block_var exactly, i.e. vi=i

Mutate:

  • it may be more understandable to show how to mutate by code
reduce_loop(i)
    ...
      block({vi=i, vj, vk, ...})
        Inc(A[..., ..., ...], Expr) # note that the indexes of A are all data_par block_var related
==>
loop_outer=i0
  loop_inner=i1
    ...
       block({vi=i0*factor+i1, vj, vk, ...})
          Inc(BF[vj, vk, ...,], Expr) # vj, vk, ... are all data_par block vars
loop_outer=i0
   ...
     block({vi=i0, vj, vk, ... })
       Inc(A[..., ..., ...], BF[vj, vk, ]) # vj, vk, ... are all data_par block vars

Proof:
This is clear to be correct. No matter where we decide to generate the zeroing block.

6.unroll

Unroll is trivial.

7.vectorize/parallel/bind(input_loop)

vectorize, parallel and bind are similar primitives, the core of their constraint is that the loop should be parallelizable.

Check:

    1. check the block under is complete block or reduction block
    1. check input_loop is bound and only bound to data_par block_vars
    1. check the loops of reduction blocks are validatable

Mutate:

    1. set Annotation on the loop

Proof:

We prove by showing that there are no data flows between input_loop=i andinput_loop=j , and we show this by induction on the number of blocks.
IMG_0636(20200326-150744)

If there is only one block below

  • The block is complete. All the instances are independent of each other.
  • The block is reduction. input_loop bound and only bound to data_par blocks + loops of reduction blocks are validatable => instances of input_loop=i will write different positions with instances of input_loop=j, hence they are independent.-

If there's a new block coming in. Consider its instances under input_loop=i.

  • If the producer is complete. Producer instances under input_loop=j may write the positions that new instances under input_loop=i may read, but it will read the same value produced by the producer under input_loop=i since it's complete.
  • If the producer is reduction. Producer instances under input_loop=j will never write the positions that new instances under input_loop=j may read. Hence no data flow.

8.cache_read

Ongoing.

9.cache_write

Ongoing.

10. blockize

Ongoing.

11. tensorize

Ongoing.

@spectrometerHBH
Copy link
Collaborator

spectrometerHBH commented Mar 14, 2020

It seems to me that Proposal 2 is somewhat independent of current issues. We may find a validatable way to transform the TIR without the need of 1-1 correspondence. But I think it's a subtle problem on how we understand blocks.

Would love to hear your comments and would be great if you can have a look at current algorithms(check, mutation) and proofs. @tqchen @Hzfengsy

@spectrometerHBH
Copy link
Collaborator

spectrometerHBH commented Mar 14, 2020

Btw, if we aim to automatically detect iter types for block vars in the future, we may encounter the 1-1 correspondence problem again(e.g. we want to ensure the write position are mutually different for different running instances). But it seems to be simpler than binding function. We can support only linearly mapped buffer access patterns for now.

@spectrometerHBH
Copy link
Collaborator

spectrometerHBH commented Mar 15, 2020

Default zeroing position of reduction block

for i in range(10):
  for j in range(10):
    for k in range(10):
      Inc(C[i,j], A[i,k]*B[k,j])

As said, a reduction block will zero its output region, but we have several choices to choose the zeroing position.-

  • before the top
for i in range(10): # zeroing position
  for j in range(10):
    C[i,j]=0
for i in range(10):
  for j in range(10):
    for k in range(10):
      Inc(C[i,j], A[i,k]*B[k,j])
  • under the father of the highest loop related to reduce block var
for i in range(10):
  for j in range(10):
    C[i,j] = 0 # zeroing position
    for k in range(10):
      Inc(C[i,j], A[i,k]*B[k,j])

These are several points to consider
1. Whether the zeroing position will be influenced by reordering.

Clearly, 1 will not be influenced by reordering, 2 will if the bottom of reorder is the original zeroing position. It brings a problem to dependency. For example

for i in range(10):
  for j in range(10):
    for k in range(10):
      block():
        ...
      block():
        Inc(C[i,j], A[i,k]*B[k,j])

If we apply reorder (k, i), then the equivalent program transform is

for i in range(10):
  for j in range(10):
    for k in range(10):
      block(name="foo"):
        ...
      block(name="zeroC"):
        C[i,j]=0
      block():
        Inc(C[i,j], A[i,k]*B[k,j])

==>

for k in range(10):
  block(name="zeroC"):
      C[i,j]=0
  for i in range(10):
    for j in range(10):
      block(name="foo"):
        ...
      block():
        Inc(C[i,j], A[i,k]*B[k,j])

The execution order between zeroC and foo is changed, which bring potential dependency risks.

@spectrometerHBH
Copy link
Collaborator

Complete is not that complete

for i0
  block(name="A")
    A[i0] = B[i0] + 1
  block(name="C")
    C[i0] = A[i0+1]
    
for i1
   D[i1] = C[i1]

I've listed this code snippet above. We can see that even if block A is complete, we can't guarantee whether the "consumer" C consumes A's output or not.
Another example shows that the consumer may appear earlier in the AST.

for i0
  block(name="A")
    A[i0] = B[i0] + 1
  block(name="B")
    B[i0] = A[i0] + 1

Possible Solution: one-way fine-grained dataflow check

Suppose a loop tree has several blocks on the leaves. We can sort them by DFS order as B1, B2, ...., Bn.

If

  • All the blocks are complete
  • Bi doesn't read the buffers that Bi+1, Bi+2, ... Bn will write
  • Suppose Bi reads Bj's output buffer(j < i) and Loop k is the LCA of Bi and Bj, Bj's output region covers Bi's input under Loop k

Then we are safe to decompose the whole loop tree.

@spectrometerHBH
Copy link
Collaborator

spectrometerHBH commented Mar 24, 2020

one-way fine-grained dataflow check

As said above, we want to check

  • Suppose Bi reads Bj's output buffer(j < i) and Loop k is the LCA of Bi and Bj, Bj's output region covers Bi's input under Loop k

This works when the Block will write every position in the output region. A relaxed output region will lead to error.

For example,
A loop i has two blocks below.

for i
  Block1
  Block2

Block 1 reads A and writes B. Block 2 reads B and writes C.

When i = 0, Block1 writes B[1,3,5,7,9], Block2 reads B[1,2,3,4,5]
When i = 1, Block1 writes B[2,4,6,8,10], Block2 reads B[1,2,3,4,5]

Then i is not parallelizable.

If a block doesn't write each position of the output region, e.g. A[0,0] is not written but in the output region. It is equivalent that block reads A[0,0] and writes the same value to A[0,0], which means the block reads its output tensor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants