[Relay] Introduce arguments limit to FuseOps pass #15137

echuraev · 2023-06-21T15:33:45Z

In PR #8313 a parameter max_function_args was introduced. It leads to
limit the number of function arguments and in the case when this value is
exceeded, then the concatenation layer is split into a several concat
operations.

I faced a problem on Adreno GPU that for kernel with a big number of
arguments the enqueueNDRange was crashed without any errors. The
problem appeared because of the huge number of arguments. But in this
case not only concat layer was a root cause of the problem. Also after
fusing several operations the final functions had a big number of
arguments.

As it was discussed in #8313, adding a limitation on the number of
function arguments to the FuseOps pass might be a good improvement. In
this PR I introduced such a mechanism for limiting the number of function
arguments for FuseOps pass and add an arguments limit to OpenCL devices
at 128 parameters.

The idea of the current approach is to calculate the number of arguments for
each node in the fusing algorithm and in case then the number of function
arguments exceeds the limit, specified by max_function_args, then the
fusing should be stopped. In case when a node has several inputs and for
some of the inputs the number of arguments wasn't computed, then we
postpone fusing for this node and will try to fuse this node later when
the number of arguments will be computed for all inputs. This approach
with postponed fusing helps to avoid additional computations during
compilation.

Additionally, the case of dynamic shapes should be handled. In this case,
function arguments also included sizes of dynamic dimension and strides.
The number of strides can be computed by calculating the number of tensor
dimensions (the number of strides equals the rank of the tensor). The number
of additional parameters with sizes of dynamic dimensions can be calculated
by computing the number of dynamic dimensions.

cc: @Hzfengsy, @masahi, @csullivan

tvm-bot · 2023-06-21T15:33:49Z

Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.

cc @shingjan _{See #10317 for details}

_{Generated by tvm-bot}

echuraev · 2023-07-06T15:02:02Z

@csullivan, @masahi, @Hzfengsy now all functional and performance issues are fixed. Could you please review this PR?

Hzfengsy · 2023-07-10T11:30:33Z

@masahi Could you please review this one?

include/tvm/relay/transform.h

src/relay/analysis/graph_partitioner.h

src/relay/analysis/graph_partitioner.cc

src/relay/analysis/graph_partitioner.h

In PR apache#8313 a parameter `max_function_args` was introduced. It leads to limit number of function argument and in case when this value is exceeded then concatenation layer is split to a several concat operations. I faced a problem on Adreno GPU that for kernel with big number of arguments the enqueueNDRange was crashed without any errors. The problem appeared because of the huge number of arguments. But in this case not only concat layer was a root cause of the problem. Also after fusing several operations the final functions had a big number of arguments. As it was discussed in apache#8313, adding a limitation on the number of function arguments to the FuseOps pass might be a good improvement. In this PR I introduced such mechanism for limitation number of function arguments for FuseOps pass and add an arguments limit to OpenCL devices at 128 parameters. The idea of current approach is calculate the number of arguments for each node in fusing algorithm and in case then the number of function arguments exceeds the limit, specified by `max_function_args`, then the fusing should be stopped. In case when node has several inputs and for some of the inputs the number of arguments wasn't computed, then we postpone fusing for this node and will try fuse this node later when the number of arguments will be computed for all inputs. This approach with postponed fusing helps to avoid additional computations during compilation. Additionally, case of dynamic shapes should be handled. In case of dynamic shape, function arguments also included sizes of dynamic dimension and strides. The number of strides can be computed by calculating number of tensor dimensions (the number of strides equals to the rank of the tensor). The number of additional parameters with sizes of dynamic dimensions can be calculated by computing number of dynamic dimensions.

echuraev · 2023-07-18T08:48:49Z

Hi @masahi!

Thank you for your review, and sorry for the delay in the response. I found one accuracy problem in previous implementation and it was necessary to reimplement this algorithm almost from scratch. I have added a few new tests and also checked the accuracy on several networks. Now everything is ok. Also, I have applied some of your comments in the new commit.

Could you please take a look at this PR once again?

csullivan

Thanks @echuraev!

csullivan · 2023-07-18T16:03:44Z

src/target/target_kind.cc

+    // case when the number of kernel arguments was pretty big. OpenCL doesn't
+    // specify any limitations on the number of kernel arguments. max_function_args
+    // equals to 128 looks like a reasonable number of kernel arguments.
+    .add_attr_option<Integer>("max_function_args", Integer(128))


We discussed offline considering making this only the case for Adreno instead of any opencl. The issue is there appears to be a limitation in FuseOps exposed by this PR where the remainder of ops (mod max_function_args) not fused do not fuse together and remain individual primfuncs. @echuraev will follow up with an issue to track this exposed limitation of FuseOps.

echuraev · 2023-07-19T11:53:30Z

@csullivan In the bug #15358 I have described a problem related to the FuseOps pass algorithm. The same problem applicable for the functionality which was introduced in this PR. When we stop fusing because the number of the function arguments was exceeded, then the first PrimFunc will contain the maximum possible number of base functions (if one more base function is added, then the max_function_args is exceeded). But other PrimFuncs will contain only one base function in it.

I suppose it is a generic problem in current fusion algorithm and it should be fixed separately in another PR.

masahi

There are too many obvious grammar errors in the comments. Please go through them.

src/relay/analysis/graph_partitioner.cc

echuraev · 2023-07-20T06:03:36Z

@masahi Thank you for your review! I have corrected all mistakes that I have found. Please, let me know if I missed something.

src/relay/analysis/graph_partitioner.h

echuraev · 2023-07-20T07:41:13Z

@masahi thank you for your review and help! Applied all comments.

echuraev · 2023-07-21T09:18:43Z

@tvm-bot rerun

Fix FuseOps to adapt #15137 Fix TIR TVMScript to adapt #15214

echuraev marked this pull request as draft June 22, 2023 11:29

echuraev marked this pull request as ready for review July 6, 2023 15:00

masahi self-assigned this Jul 10, 2023

masahi reviewed Jul 10, 2023

View reviewed changes

echuraev force-pushed the echuraev/add_arguments_limitation_to_fuse_ops branch from 5f6cc38 to 08a4494 Compare July 18, 2023 08:43

csullivan reviewed Jul 18, 2023

View reviewed changes

Fix memory_scope order in test

d36a899

echuraev mentioned this pull request Jul 19, 2023

[Bug][FuseOps] Ops don't fuse together and remain individual PrimFuncs #15358

Closed

masahi reviewed Jul 20, 2023

View reviewed changes

src/relay/analysis/graph_partitioner.cc Outdated Show resolved Hide resolved

Apply code review comments

58f17b7

echuraev force-pushed the echuraev/add_arguments_limitation_to_fuse_ops branch from 912eaac to 58f17b7 Compare July 20, 2023 07:06

masahi approved these changes Jul 20, 2023

View reviewed changes

src/relay/analysis/graph_partitioner.h Outdated Show resolved Hide resolved

src/relay/analysis/graph_partitioner.h Outdated Show resolved Hide resolved

Apply comments

0e3f1f0

masahi merged commit 7ebc802 into apache:main Jul 21, 2023

MasterJH5574 added a commit to mlc-ai/relax that referenced this pull request Aug 1, 2023

Fix FuseOps to adapt apache/tvm#15137

7a603c9

tqchen pushed a commit that referenced this pull request Aug 1, 2023

[MERGE-FIX] Update the code to fix merge issues

53fd712

Fix FuseOps to adapt #15137 Fix TIR TVMScript to adapt #15214

ysh329 mentioned this pull request Oct 18, 2023

[Release] v0.14.0 Release Candidate Notes #15948

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Relay] Introduce arguments limit to FuseOps pass #15137

[Relay] Introduce arguments limit to FuseOps pass #15137

echuraev commented Jun 21, 2023 •

edited

Loading

tvm-bot commented Jun 21, 2023

echuraev commented Jul 6, 2023

Hzfengsy commented Jul 10, 2023

echuraev commented Jul 18, 2023

csullivan left a comment

csullivan Jul 18, 2023

echuraev commented Jul 19, 2023

masahi left a comment

echuraev commented Jul 20, 2023

echuraev commented Jul 20, 2023

echuraev commented Jul 21, 2023

[Relay] Introduce arguments limit to FuseOps pass #15137

[Relay] Introduce arguments limit to FuseOps pass #15137

Conversation

echuraev commented Jun 21, 2023 • edited Loading

tvm-bot commented Jun 21, 2023

echuraev commented Jul 6, 2023

Hzfengsy commented Jul 10, 2023

echuraev commented Jul 18, 2023

csullivan left a comment

Choose a reason for hiding this comment

csullivan Jul 18, 2023

Choose a reason for hiding this comment

echuraev commented Jul 19, 2023

masahi left a comment

Choose a reason for hiding this comment

echuraev commented Jul 20, 2023

echuraev commented Jul 20, 2023

echuraev commented Jul 21, 2023

echuraev commented Jun 21, 2023 •

edited

Loading