-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Relay] Introduce arguments limit to FuseOps pass #15137
[Relay] Introduce arguments limit to FuseOps pass #15137
Conversation
Thanks for contributing to TVM! Please refer to the contributing guidelines https://tvm.apache.org/docs/contribute/ for useful information and tips. Please request code reviews from Reviewers by @-ing them in a comment. Generated by tvm-bot |
@csullivan, @masahi, @Hzfengsy now all functional and performance issues are fixed. Could you please review this PR? |
@masahi Could you please review this one? |
In PR apache#8313 a parameter `max_function_args` was introduced. It leads to limit number of function argument and in case when this value is exceeded then concatenation layer is split to a several concat operations. I faced a problem on Adreno GPU that for kernel with big number of arguments the enqueueNDRange was crashed without any errors. The problem appeared because of the huge number of arguments. But in this case not only concat layer was a root cause of the problem. Also after fusing several operations the final functions had a big number of arguments. As it was discussed in apache#8313, adding a limitation on the number of function arguments to the FuseOps pass might be a good improvement. In this PR I introduced such mechanism for limitation number of function arguments for FuseOps pass and add an arguments limit to OpenCL devices at 128 parameters. The idea of current approach is calculate the number of arguments for each node in fusing algorithm and in case then the number of function arguments exceeds the limit, specified by `max_function_args`, then the fusing should be stopped. In case when node has several inputs and for some of the inputs the number of arguments wasn't computed, then we postpone fusing for this node and will try fuse this node later when the number of arguments will be computed for all inputs. This approach with postponed fusing helps to avoid additional computations during compilation. Additionally, case of dynamic shapes should be handled. In case of dynamic shape, function arguments also included sizes of dynamic dimension and strides. The number of strides can be computed by calculating number of tensor dimensions (the number of strides equals to the rank of the tensor). The number of additional parameters with sizes of dynamic dimensions can be calculated by computing number of dynamic dimensions.
5f6cc38
to
08a4494
Compare
Hi @masahi! Thank you for your review, and sorry for the delay in the response. I found one accuracy problem in previous implementation and it was necessary to reimplement this algorithm almost from scratch. I have added a few new tests and also checked the accuracy on several networks. Now everything is ok. Also, I have applied some of your comments in the new commit. Could you please take a look at this PR once again? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @echuraev!
// case when the number of kernel arguments was pretty big. OpenCL doesn't | ||
// specify any limitations on the number of kernel arguments. max_function_args | ||
// equals to 128 looks like a reasonable number of kernel arguments. | ||
.add_attr_option<Integer>("max_function_args", Integer(128)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed offline considering making this only the case for Adreno instead of any opencl. The issue is there appears to be a limitation in FuseOps exposed by this PR where the remainder of ops (mod max_function_args) not fused do not fuse together and remain individual primfuncs. @echuraev will follow up with an issue to track this exposed limitation of FuseOps.
@csullivan In the bug #15358 I have described a problem related to the FuseOps pass algorithm. The same problem applicable for the functionality which was introduced in this PR. When we stop fusing because the number of the function arguments was exceeded, then the first PrimFunc will contain the maximum possible number of base functions (if one more base function is added, then the I suppose it is a generic problem in current fusion algorithm and it should be fixed separately in another PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are too many obvious grammar errors in the comments. Please go through them.
@masahi Thank you for your review! I have corrected all mistakes that I have found. Please, let me know if I missed something. |
912eaac
to
58f17b7
Compare
@masahi thank you for your review and help! Applied all comments. |
@tvm-bot rerun |
In PR #8313 a parameter
max_function_args
was introduced. It leads tolimit the number of function arguments and in the case when this value is
exceeded, then the concatenation layer is split into a several concat
operations.
I faced a problem on Adreno GPU that for kernel with a big number of
arguments the enqueueNDRange was crashed without any errors. The
problem appeared because of the huge number of arguments. But in this
case not only concat layer was a root cause of the problem. Also after
fusing several operations the final functions had a big number of
arguments.
As it was discussed in #8313, adding a limitation on the number of
function arguments to the FuseOps pass might be a good improvement. In
this PR I introduced such a mechanism for limiting the number of function
arguments for FuseOps pass and add an arguments limit to OpenCL devices
at 128 parameters.
The idea of the current approach is to calculate the number of arguments for
each node in the fusing algorithm and in case then the number of function
arguments exceeds the limit, specified by
max_function_args
, then thefusing should be stopped. In case when a node has several inputs and for
some of the inputs the number of arguments wasn't computed, then we
postpone fusing for this node and will try to fuse this node later when
the number of arguments will be computed for all inputs. This approach
with postponed fusing helps to avoid additional computations during
compilation.
Additionally, the case of dynamic shapes should be handled. In this case,
function arguments also included sizes of dynamic dimension and strides.
The number of strides can be computed by calculating the number of tensor
dimensions (the number of strides equals the rank of the tensor). The number
of additional parameters with sizes of dynamic dimensions can be calculated
by computing the number of dynamic dimensions.
cc: @Hzfengsy, @masahi, @csullivan