-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TIR] Utility function to decide loop mapping for auto tensorization #11050
Conversation
next_block_ind = i_block - 1; | ||
break; | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic here is very different from the one in the original code https://github.com/spectrometerHBH/tvm/blob/auto-tensorization/src/tir/schedule/analysis/analysis.cc#L1246. I was not able to understand why the original code has been written that way and it didn't work for the case where matching loops in the target block are not in the innermost positions (conv2d NCHWc on CPU, a test in
def test_get_tensorize_loop_mapping_conv2d_nchwc_vnni(): |
I think my change is simple and obvious. The condition for a match is (1) divisibility of loop extent and (2) matching iterator types (reduction vs spatial). Mapping is determined starting from the innermost axis.
Please have a look at this change carefully, and let me know if I need to bring back some logic in the original code @spectrometerHBH @vinx13
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would love to have @spectrometerHBH review this change before merging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal of the original mapping is to support
for k:
for i:
for j:
C[i, j] += A[i, k] * B[k, j]
where loops are not in the same order as the tensor intrinsic description function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But it also makes sense if we don't support such cases for this PR. So I approve it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @spectrometerHBH, I now understand the original code and was able to integrate the original logic to support loop permutations. Please have a look at the current diff, also cc @vinx13 @Hzfengsy @MasterJH5574
The key difference between the original code and the code I submitted yesterday was that, my code was looking at only the loop nest (ForNode
) to determine the mapping, while @spectrometerHBH's mapping logic is based on iter_var/value
of the block (so invariant to the order of the loop nest).
d6ae848
to
1ff1df9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Masa! I just caught a minor point.
This reverts commit eb147f3.
Co-authored-by: Ruihang Lai <lairuihangdongdong@qq.com>
e504536
to
94391b1
Compare
94391b1
to
9ec0974
Compare
212d5dc
to
2909a06
Compare
ICHECK(desc_loops.size() == static_cast<size_t>(n_desc_vars)); | ||
ICHECK(block_loops.size() == iter_types_block.size()); | ||
|
||
// We assume that the orders of iter_vars in the target and the desc block are consistent. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i.e., no matter what the permutation of loop is, we should always have
i, j, k = T.axis.remap("SSR", [i0, i1, i2])
for GEMM.
I think this is a reasonable assumption. Correct me if I'm wrong @spectrometerHBH @junrushao1994 @vinx13
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree this is a reasonable assumption. Though there might be corner cases, it covers all of the current use cases
Add
TensorizeInfo
structure andGetTensorizeLoopMapping
function, that are used for determining the correspondence of loops between a target block and an intrinsic description.Matching is based on a heuristic: It works in all cases I tested (CPU dot product for dense / conv2d, CPU / GPU matmul), but there is no guarantee that it always finds the "right" mapping. If the mapping is not correct, tensorize would fail.
The original code is https://github.com/spectrometerHBH/tvm/blob/auto-tensorization/src/tir/schedule/analysis/analysis.cc#L1175, I modified this code to support more cases and added tests. I'm sending this PR on behalf of the team, but most of the work were done by others earlier.
Co-authored-by: Siyuan Feng Hzfengsy@sjtu.edu.cn
Co-authored-by: Bohan Hou 32121147+spectrometerHBH@users.noreply.github.com
Co-authored-by: Hongyi Jin 3231950289@qq.com
Co-authored-by: Ruihang Lai lairuihangdongdong@qq.com
Co-authored-by: Wuwei Lin wuwei@apache.org
@vinx13 @junrushao1994 @spectrometerHBH @Hzfengsy @MasterJH5574 @jinhongyii