-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RELAY][PASS] CombineParallelConv2D #2089
Conversation
4d24076
to
0705a3f
Compare
@vinx13 can you also elaborate a bit on possible use-cases of this pass? |
5c5c8dc
to
2accc32
Compare
I guess a bigger kernel is better for performance. |
TensorRT does this optimization https://devblogs.nvidia.com/deploying-deep-learning-nvidia-tensorrt/#attachment_6827 |
@vinx13 What happens if convolutions are followed by elemwise ops? (bias, batch norm etc, which should be fused into a single op) Elemwise ops that follow a convolution can be different for each child convolution branch. Can the folded convolution still be fused with elemwise ops? |
After this pass, the original convolutions will be replaced with |
hmm interesting. I understand that we can at least fuse elemwise ops into each Code itself looks good though. I understood how it works. |
So we should also fold the following bias, relu and bn to keep the final graph easy for fusion. These elementwise ops should be placed before |
yeah, but element wise ops that follow convolution can be different for each child convolution branch (I don't know about inception network). So I think there will be less chance of folding? Since there are multiple convolution branches, not everything can be fused anyway. If we fold element wise ops as well, we are left with multiple I think we might as well stop fusing (or "Realize" in a NNVM term) at the folded convolution op, and let |
I am working on a new fusion pass in the new relay IR and hopefully we can followup on this topic there |
Some specific followup comments:
|
03891e8
to
a4078ba
Compare
src/relay/pass/pattern_util.h
Outdated
@@ -135,6 +150,20 @@ inline Constant MakeConstantScalar(DataType dtype, T value) { | |||
return ConstantNode::make(arr); | |||
} | |||
|
|||
template<typename T, typename = typename std::enable_if<std::is_integral<T>::value>::type> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let us use slice instead of take to get the final result
The new set of changes LGTM, @vinx13 can you also add support for fusing followup ops and a testcase? |
strided_slice is added in #2094 |
2365049
to
d0599e8
Compare
@MarisaKirisame @merrymercy can you please take another look and https://docs.tvm.ai/contribute/code_review.html#approve-and-request-changes-explicitly |
@vinx13 please rebase against the master |
e2f2cd1
to
a0f8331
Compare
a0f8331
to
bed2900
Compare
Thanks, @MarisaKirisame @masahi @vinx13 this is merged! |
This pass replace convolutions that share the same input node and the same arguments (except that the number of output channels can be different) with a single convolution. The weight of the new 2d convolution is the concatenation of the original weights.
The original conv2d nodes are replaced with
strided_slice
ops to get a slices of the output of the new conv2d.This prevents launching multiple kernels in networks with multiple convolution branches, such as Inception block.
Algorithm
where op is elemwise or broadcast op.
2. Group branches by kernel shape and attrs of conv2d.
3. Combine conv2d in the same group and possibly combine subsequent ops.
4. Use strided_slice to split the output of the combined op.
Please review @tqchen @jroesch @ZihengJiang @masahi