Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Relay] Introduce arguments limit to FuseOps pass #15137

Merged

Conversation

echuraev
Copy link
Contributor

@echuraev echuraev commented Jun 21, 2023

In PR #8313 a parameter max_function_args was introduced. It leads to
limit the number of function arguments and in the case when this value is
exceeded, then the concatenation layer is split into a several concat
operations.

I faced a problem on Adreno GPU that for kernel with a big number of
arguments the enqueueNDRange was crashed without any errors. The
problem appeared because of the huge number of arguments. But in this
case not only concat layer was a root cause of the problem. Also after
fusing several operations the final functions had a big number of
arguments.

As it was discussed in #8313, adding a limitation on the number of
function arguments to the FuseOps pass might be a good improvement. In
this PR I introduced such a mechanism for limiting the number of function
arguments for FuseOps pass and add an arguments limit to OpenCL devices
at 128 parameters.

The idea of the current approach is to calculate the number of arguments for
each node in the fusing algorithm and in case then the number of function
arguments exceeds the limit, specified by max_function_args, then the
fusing should be stopped. In case when a node has several inputs and for
some of the inputs the number of arguments wasn't computed, then we
postpone fusing for this node and will try to fuse this node later when
the number of arguments will be computed for all inputs. This approach
with postponed fusing helps to avoid additional computations during
compilation.

Additionally, the case of dynamic shapes should be handled. In this case,
function arguments also included sizes of dynamic dimension and strides.
The number of strides can be computed by calculating the number of tensor
dimensions (the number of strides equals the rank of the tensor). The number
of additional parameters with sizes of dynamic dimensions can be calculated
by computing the number of dynamic dimensions.

cc: @Hzfengsy, @masahi, @csullivan

@tvm-bot
Copy link
Collaborator

tvm-bot commented Jun 21, 2023

Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment.

Generated by tvm-bot

@echuraev echuraev marked this pull request as draft June 22, 2023 11:29
@echuraev echuraev marked this pull request as ready for review July 6, 2023 15:00
@echuraev
Copy link
Contributor Author

echuraev commented Jul 6, 2023

@csullivan, @masahi, @Hzfengsy now all functional and performance issues are fixed. Could you please review this PR?

@Hzfengsy
Copy link
Member

@masahi Could you please review this one?

@masahi masahi self-assigned this Jul 10, 2023
include/tvm/relay/transform.h Outdated Show resolved Hide resolved
src/relay/analysis/graph_partitioner.h Outdated Show resolved Hide resolved
src/relay/analysis/graph_partitioner.h Show resolved Hide resolved
src/relay/analysis/graph_partitioner.cc Outdated Show resolved Hide resolved
src/relay/analysis/graph_partitioner.h Show resolved Hide resolved
In PR apache#8313 a parameter `max_function_args` was introduced. It leads to
limit number of function argument and in case when this value is
exceeded then concatenation layer is split to a several concat
operations.

I faced a problem on Adreno GPU that for kernel with big number of
arguments the enqueueNDRange was crashed without any errors. The
problem appeared because of the huge number of arguments. But in this
case not only concat layer was a root cause of the problem. Also after
fusing several operations the final functions had a big number of
arguments.

As it was discussed in apache#8313, adding a limitation on the number of
function arguments to the FuseOps pass might be a good improvement. In
this PR I introduced such mechanism for limitation number of function
arguments for FuseOps pass and add an arguments limit to OpenCL devices
at 128 parameters.

The idea of current approach is calculate the number of arguments for
each node in fusing algorithm and in case then the number of function
arguments exceeds the limit, specified by `max_function_args`, then the
fusing should be stopped. In case when node has several inputs and for
some of the inputs the number of arguments wasn't computed, then we
postpone fusing for this node and will try fuse this node later when
the number of arguments will be computed for all inputs. This approach
with postponed fusing helps to avoid additional computations during
compilation.

Additionally, case of dynamic shapes should be handled.  In case of
dynamic shape, function arguments also included sizes of dynamic
dimension and strides. The number of strides can be computed by
calculating number of tensor dimensions (the number of strides equals
to the rank of the tensor). The number of additional parameters with
sizes of dynamic dimensions can be calculated by computing number of
dynamic dimensions.
@echuraev echuraev force-pushed the echuraev/add_arguments_limitation_to_fuse_ops branch from 5f6cc38 to 08a4494 Compare July 18, 2023 08:43
@echuraev
Copy link
Contributor Author

Hi @masahi!

Thank you for your review, and sorry for the delay in the response. I found one accuracy problem in previous implementation and it was necessary to reimplement this algorithm almost from scratch. I have added a few new tests and also checked the accuracy on several networks. Now everything is ok. Also, I have applied some of your comments in the new commit.

Could you please take a look at this PR once again?

Copy link
Contributor

@csullivan csullivan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @echuraev!

// case when the number of kernel arguments was pretty big. OpenCL doesn't
// specify any limitations on the number of kernel arguments. max_function_args
// equals to 128 looks like a reasonable number of kernel arguments.
.add_attr_option<Integer>("max_function_args", Integer(128))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed offline considering making this only the case for Adreno instead of any opencl. The issue is there appears to be a limitation in FuseOps exposed by this PR where the remainder of ops (mod max_function_args) not fused do not fuse together and remain individual primfuncs. @echuraev will follow up with an issue to track this exposed limitation of FuseOps.

@echuraev
Copy link
Contributor Author

@csullivan In the bug #15358 I have described a problem related to the FuseOps pass algorithm. The same problem applicable for the functionality which was introduced in this PR. When we stop fusing because the number of the function arguments was exceeded, then the first PrimFunc will contain the maximum possible number of base functions (if one more base function is added, then the max_function_args is exceeded). But other PrimFuncs will contain only one base function in it.

I suppose it is a generic problem in current fusion algorithm and it should be fixed separately in another PR.

Copy link
Member

@masahi masahi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are too many obvious grammar errors in the comments. Please go through them.

@echuraev
Copy link
Contributor Author

@masahi Thank you for your review! I have corrected all mistakes that I have found. Please, let me know if I missed something.

@echuraev echuraev force-pushed the echuraev/add_arguments_limitation_to_fuse_ops branch from 912eaac to 58f17b7 Compare July 20, 2023 07:06
src/relay/analysis/graph_partitioner.h Outdated Show resolved Hide resolved
src/relay/analysis/graph_partitioner.h Outdated Show resolved Hide resolved
@echuraev
Copy link
Contributor Author

@masahi thank you for your review and help! Applied all comments.

@echuraev
Copy link
Contributor Author

@tvm-bot rerun

@masahi masahi merged commit 7ebc802 into apache:main Jul 21, 2023
MasterJH5574 added a commit to mlc-ai/relax that referenced this pull request Aug 1, 2023
tqchen pushed a commit that referenced this pull request Aug 1, 2023
Fix FuseOps to adapt #15137
Fix TIR TVMScript to adapt #15214
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants