Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TIR] Utility function to decide loop mapping for auto tensorization #11050

Merged
merged 19 commits into from
Apr 20, 2022

Conversation

masahi
Copy link
Member

@masahi masahi commented Apr 18, 2022

Add TensorizeInfo structure and GetTensorizeLoopMapping function, that are used for determining the correspondence of loops between a target block and an intrinsic description.

Matching is based on a heuristic: It works in all cases I tested (CPU dot product for dense / conv2d, CPU / GPU matmul), but there is no guarantee that it always finds the "right" mapping. If the mapping is not correct, tensorize would fail.

The original code is https://github.com/spectrometerHBH/tvm/blob/auto-tensorization/src/tir/schedule/analysis/analysis.cc#L1175, I modified this code to support more cases and added tests. I'm sending this PR on behalf of the team, but most of the work were done by others earlier.

Co-authored-by: Siyuan Feng Hzfengsy@sjtu.edu.cn
Co-authored-by: Bohan Hou 32121147+spectrometerHBH@users.noreply.github.com
Co-authored-by: Hongyi Jin 3231950289@qq.com
Co-authored-by: Ruihang Lai lairuihangdongdong@qq.com
Co-authored-by: Wuwei Lin wuwei@apache.org

@vinx13 @junrushao1994 @spectrometerHBH @Hzfengsy @MasterJH5574 @jinhongyii

next_block_ind = i_block - 1;
break;
}
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic here is very different from the one in the original code https://github.com/spectrometerHBH/tvm/blob/auto-tensorization/src/tir/schedule/analysis/analysis.cc#L1246. I was not able to understand why the original code has been written that way and it didn't work for the case where matching loops in the target block are not in the innermost positions (conv2d NCHWc on CPU, a test in

def test_get_tensorize_loop_mapping_conv2d_nchwc_vnni():
).

I think my change is simple and obvious. The condition for a match is (1) divisibility of loop extent and (2) matching iterator types (reduction vs spatial). Mapping is determined starting from the innermost axis.

Please have a look at this change carefully, and let me know if I need to bring back some logic in the original code @spectrometerHBH @vinx13

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would love to have @spectrometerHBH review this change before merging

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal of the original mapping is to support

for k:
  for i:
    for j:
        C[i, j] += A[i, k] * B[k, j]

where loops are not in the same order as the tensor intrinsic description function.

Copy link
Contributor

@spectrometerHBH spectrometerHBH Apr 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But it also makes sense if we don't support such cases for this PR. So I approve it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @spectrometerHBH, I now understand the original code and was able to integrate the original logic to support loop permutations. Please have a look at the current diff, also cc @vinx13 @Hzfengsy @MasterJH5574

The key difference between the original code and the code I submitted yesterday was that, my code was looking at only the loop nest (ForNode) to determine the mapping, while @spectrometerHBH's mapping logic is based on iter_var/value of the block (so invariant to the order of the loop nest).

@masahi masahi force-pushed the tir-tensorize-loop-mapping branch from d6ae848 to 1ff1df9 Compare April 19, 2022 00:37
Copy link
Contributor

@MasterJH5574 MasterJH5574 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Masa! I just caught a minor point.

python/tvm/tir/schedule/analysis.py Outdated Show resolved Hide resolved
@masahi masahi force-pushed the tir-tensorize-loop-mapping branch from e504536 to 94391b1 Compare April 19, 2022 20:59
@masahi masahi force-pushed the tir-tensorize-loop-mapping branch from 94391b1 to 9ec0974 Compare April 19, 2022 21:04
@masahi masahi force-pushed the tir-tensorize-loop-mapping branch from 212d5dc to 2909a06 Compare April 19, 2022 21:39
ICHECK(desc_loops.size() == static_cast<size_t>(n_desc_vars));
ICHECK(block_loops.size() == iter_types_block.size());

// We assume that the orders of iter_vars in the target and the desc block are consistent.
Copy link
Member Author

@masahi masahi Apr 19, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i.e., no matter what the permutation of loop is, we should always have

i, j, k = T.axis.remap("SSR", [i0, i1, i2])

for GEMM.

I think this is a reasonable assumption. Correct me if I'm wrong @spectrometerHBH @junrushao1994 @vinx13

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this is a reasonable assumption. Though there might be corner cases, it covers all of the current use cases

@vinx13 vinx13 merged commit 3823b39 into apache:main Apr 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants