-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DISCUSS] Thoughts on Correctness Validation #45
Comments
The root of the problem is still about trying to run compute_at of a block that is not a complete provider. Proposal 1 sounds good. Note that we can go further to forbid reduction in a complete block and force user to blockize. The init-reduction operator might introduce a sugar that is equivalent to(blockize, compute_at, unblockized) |
I've updated proposal 2. I will post my proofs for proposal 1 tomorrow. |
It seems to me that Proposal 2 is somewhat independent of current issues. We may find a validatable way to transform the TIR without the need of 1-1 correspondence. But I think it's a subtle problem on how we understand blocks. Would love to hear your comments and would be great if you can have a look at current algorithms(check, mutation) and proofs. @tqchen @Hzfengsy |
Btw, if we aim to automatically detect iter types for block vars in the future, we may encounter the 1-1 correspondence problem again(e.g. we want to ensure the write position are mutually different for different running instances). But it seems to be simpler than binding function. We can support only linearly mapped buffer access patterns for now. |
Default zeroing position of reduction block
As said, a reduction block will zero its output region, but we have several choices to choose the zeroing position.-
These are several points to consider Clearly, 1 will not be influenced by reordering, 2 will if the bottom of reorder is the original zeroing position. It brings a problem to dependency. For example
If we apply reorder (k, i), then the equivalent program transform is
The execution order between |
Complete is not that complete
I've listed this code snippet above. We can see that even if block A is complete, we can't guarantee whether the "consumer" C consumes A's output or not.
Possible Solution: one-way fine-grained dataflow checkSuppose a loop tree has several blocks on the leaves. We can sort them by DFS order as B1, B2, ...., Bn. If
Then we are safe to decompose the whole loop tree. |
one-way fine-grained dataflow checkAs said above, we want to check
This works when the Block will write every position in the output region. A relaxed output region will lead to error. For example,
Block 1 reads A and writes B. Block 2 reads B and writes C. When i = 0, Block1 writes B[1,3,5,7,9], Block2 reads B[1,2,3,4,5] Then i is not parallelizable. If a block doesn't write each position of the output region, e.g. A[0,0] is not written but in the output region. It is equivalent that block reads A[0,0] and writes the same value to A[0,0], which means the block reads its output tensor. |
After frequent discussion with @spectrometerHBH, we find there are still some problems with the correctness validation.
Before that, we usually suppose that the block can get correct input as long as the dependency is satisfied (e.g. the consumer block must execute after the producer block). However, it is not always the truth, especially when the producer block will update the output tensor. Here is an example
In this case, we can not compute_at
block B
under the Loopj0
even though the dependency is satisfied. Here is the wrong code after comput_at.The problem happens just because the producer block has not produced the final correct output during the loop. The same problem will also happen in reorder with branches. Of course, we are not trying to support the case, but we need to ban it from the legal schedule.
But it is not an easy thing either.
Proposal 0
The most radical solution is to forbid the schedule for the IR comes from the hybrid. It will solve every correctness problem since we only support the schedule for
te.compute
. It is crazy and we will not use it when we have any other way. It would be our hold card.Proposal 1
The key point is that the producer can not provide validating results. We would like to make sure all the producer blocks are complete. Here is my proposal:
data_par
(not all the block_var)compute_at
;reorder
;!+=
to make the reduction (init update) to become a complete block.Since then, the complete block equals the current tvm stage, even the reduction block. We can forbid every risky operation which is not allowed with the complete restriction.
Proposal 2
An insight we can get from the counterexample above is that we haven't come up with a validation mechanism for block and its surrounding loops. Actually, the block A's surrounding loops i0, j0 are not legal. The reasons are stated below.
The spirit of TIR is to express info in the block and generate loops outside the block to satisfy the constraints. Besides
range
information, we also have to check the binding function, i.e.(vi, vj, ....) = f(i, j, ...)
.range
only tells us which subset of the instance space of block may be run by the outside loop, but it can't tell how many times a specific running instance will be touched(maybe 0, maybe larger than 1). In the counterexample above, each instance of block "A" will be run many times because of i0. Hence if we pose no more constraints, the block info is not complete.One reasonable constraint for binding function
f
is that it builds a 1-1 correspondence between the space of(i,j,k...)
and the space of(vi,vj,vk...)
, i.e. The loop iteration and binding function will iterate all the instances inside the block var space exactly once. The previous complete block definition + this 1-1 correspondence constraint will make sure it's safe to compute_at.But it brings difficulty to check this constraint. Since
split
andfuse
will make the binding function rather complicated and it's difficult to come up with a simple pattern that is compatible with current binding we have to support and at the same time is provable to be a 1-1 correspondence.One algorithm I come up with is that we only consider the following bindings to be legal
and given a binding, we try to reverse the
split
andfuse
effect to getvi=c1*i+b1, vj=c2*j+b2, vk=c3*k+b3
. If we fail, then it is not legal.Between those 2 (excluding Proposal 0), we both prefer Proposal 1 for now.
@tqchen
The text was updated successfully, but these errors were encountered: