Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add ConvolutionDepthwise layer #5665

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

sp2823
Copy link

@sp2823 sp2823 commented Jun 2, 2017

https://arxiv.org/pdf/1704.04861v1.pdf
MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

convolution depthwise layer
faster and less memory than the "convolution layer with group" (with CuDNN and without CuDNN)

@sp2823 sp2823 closed this Jun 2, 2017
@sp2823 sp2823 reopened this Jun 2, 2017
@willyd willyd mentioned this pull request Jun 7, 2017
weight_multiplier_shape.push_back(top[0]->height());
weight_multiplier_shape.push_back(top[0]->width());
weight_multiplier_.Reshape(weight_multiplier_shape);
caffe_set(weight_multiplier_.count(), Dtype(1),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

caffe_set just for cpu_data @sp2823

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only need to set mutable_cpu_data or mutable_gpu_data once.
There is a similar implementation of batch_sum_multiplier_ in BatchNormLayer.
If it is necessary, we should use caffe_set in Forward_cpu and caffe_gpu_set in Forward_gpu.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean caffe_set is just for pointer of cpu_data, and set data to pointer of gpu_data would crash.

@zj19921221
Copy link

zj19921221 commented Jun 19, 2017

请问,两个问题请教下:
1、您这个实现和caffe中用group实现有什么优点吗?
2、模糊理解了group不能够并行的,并不是很理解为什么加了for循环就不能并行?

@NHZlX
Copy link

NHZlX commented Jun 19, 2017

cpu下有待于优化

@zjchuyp
Copy link

zjchuyp commented Jun 20, 2017

@sp2823
Is it faster by using forloop than using gemm in CPU mode?

@mathmanu
Copy link

Great to see this work - I hope it gets merged soon. The correct name for this should be "DepthwiseSeparable". Just "Depthwise" gives almost the opposite meaning.

@sp2823
Copy link
Author

sp2823 commented Jun 27, 2017

I didn't optimize the CPU mode because the Convolution layer with group is slow in GPU mode. You can use this code for training and use Convolution layer for prediction.

@youngwanLEE
Copy link

Could you share your .prototxt which show how to set parameters? or test_examples?

@mathmanu
Copy link

mathmanu commented Jul 5, 2017

I have attached the files required to train the popular mobilenet model:

imagenet_mobilenet1.0_2017-07-04_10-44-00.zip

I added the following code in layer_factory.cpp, GetConvolutionLayer() so that this layer will be called whever its appropriate to use:
if(conv_param.num_output() == conv_param.group()) {
return shared_ptr< Layer< Dtype > >(new ConvolutionDepthwiseLayer(param));
}

There is a speedup when using the proposed ConvolutionDepthwise layer is used instead of Convolution layer. But it is not as much as I expected.

If fact if I just comment the group parameter in all convolution layers in both train.prototxt and test.prototxt, so that the 3x3 convolution becomes are traditional 3x3 convolution instead of DepthWise seperable, it becomes slightly faster! This was not what I was expecting.

Is there something that I am missing? Please try the files that I shared.

@sp2823
Copy link
Author

sp2823 commented Jul 9, 2017

You only need to edit the .prototxt file like this.
type: "Convolution"
type: "ConvolutionDepthwise"

@ryusaeba
Copy link

@sp2823 How do I merge your implementation into my CAFFE? Just download the hpp/cpp is OK? Thanks :)

@sp2823
Copy link
Author

sp2823 commented Jul 10, 2017

download the .hpp/.cpp/.cu file and compile

@leochli
Copy link

leochli commented Jul 18, 2017

Hi @sp2823 ,
I am new to caffe. I got an error saying like "class caffe::ConvolutionParameter’ has no member named ‘kernel_size_size’" in conv_dw_layer.cpp when I'm trying to compile.
Any idea of this error?

@SophieZhou
Copy link

Hi, @sp2823
I have trained mobilenet using your code for 20 epoches, and the top1 about 52%, and about 76% for top5. Do you have any experiments results?
By the way, the code is very well. The speed for training is much faster than group convolution way. And I hope the results are well too.

@zj19921221
Copy link

@SophieZhou 你好,请问下在cpu下训练速度有变很快吗?性能达到什么程度。

@birdwcp
Copy link

birdwcp commented Jul 28, 2017

up

@birdwcp
Copy link

birdwcp commented Aug 1, 2017

you did not implement CuDNNConvolutionDepthWiseLayer. Isn't it necessary?

@7oud
Copy link

7oud commented Aug 18, 2017

The new implement runs faster than "convolution layer with group" (with CuDNN and without CuDNN), but it seems no fast enough. e.g. AlexNet has more FLOPS than MobileNet, but runs faster.
Cudnn v7 has already released and brings grouped convolution feature, maybe it will be faster, Did you try it?

@violet17
Copy link

@twmht
Copy link
Contributor

twmht commented Sep 27, 2017

@sp2823

Did you compare the FPS with current caffe implementation? @

@libra7
Copy link

libra7 commented Oct 27, 2017

@sp2823 how to optimize the CPU mode?

@dawuchen
Copy link

dawuchen commented Nov 3, 2017

@mathmanu The reason traditional 3x3 convolution faster than Depthwise separable conv is that caffe use forloop to execute the conv operation that with multiple group , it is very slow . But DepthWise separable conv has much less parms.

jsherrah added a commit to jsherrah/caffe that referenced this pull request Nov 20, 2017
* convolution-depthwise:
  unknown error
  abc
  satisfy the code format of caffe
  satisfy the code format of caffe
  satisfy the code format of caffe
  satisfy the code format of caffe
  satisfy the code format of caffe
  add ConvolutionDepthwise layer
  add ConvolutionDepthwise layer
@lijuan123
Copy link

hello, I added ConvolutionDepthwise layer cpp and cu in caffe_ssd in TX2 and recompile caffe, however, when I test the model for mobilenet, it shows the error: cudnn_softmax_layer.cpp:15] Check failed: status == CUDNN_STATUS_SUCCESS (4 vs. 0) CUDNN_STATUS_INTERNAL_ERROR just after conv3/dw -> conv3/dw. Should I make some changes for ConvolutionDepthwise layer cu file?

weight_buffer_shape.push_back(bottom[0]->num());
weight_buffer_shape.push_back(top[0]->height());
weight_buffer_shape.push_back(top[0]->width());
weight_buffer_.Reshape(weight_buffer_shape);
Copy link
Member

@Noiredd Noiredd Feb 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we seriously need a 6-dimensional buffer for weights? If I have a batch of 64 feature maps, let's say 256 channels, 32x32, and want to convolve with a 3x3 filter, this line would allocate 256*3*3*64*32*32 = 150,994,994 floats, so almost 600 MB - that sounds like a significant overkill.

@Noiredd
Copy link
Member

Noiredd commented Feb 9, 2018

I have made some tests, both on a raw convolution-oriented benchmark and some actually useful network. This is indeed faster than the default Caffe convolution engine for grouped conv, but the RAM requirements are higher, and I'm not so sure if this is a good thing.
On a synthetic 1x256x256x256 blob with a 5x5 convolution in group of 256 this PR gives a 33% speedup (312 ms -> 217 ms for a forward-backward pass) at a barely noticeable RAM increase (1924 MB -> 1987 MB). But on a more realistic network with several conv layers, some of them grouped some not, results varied more drastically: iteration time went down from 119 ms to 46 ms, but RAM usage jumped from 425 MB to 590 MB.

So the question is: can the same speedup be achieved while allocating less memory? Like I said in the comment, do we really need to have such a large buffer for weights?

@chibai
Copy link

chibai commented Jun 8, 2018

No offense, I'm new in github. I just want to know whether ConvolutionDepthwise add to master branch or not??
If not, what's the major problem??

@Phil-Lin
Copy link

Excuse me,I had download the core files and compile successfully. But it's wrong when training Net.
Was I miss some step??
*** Aborted at 1563169270 (unix time) try "date -d @1563169270" if you are using GNU date ***
PC: @ 0x7fa004053d3e caffe::ConvolutionDepthwiseLayer<>::LayerSetUp()
*** SIGSEGV (@0x0) received by PID 25146 (TID 0x7fa00473bb00) from PID 0; stack trace: ***
@ 0x7fa0020d74b0 (unknown)
@ 0x7fa004053d3e caffe::ConvolutionDepthwiseLayer<>::LayerSetUp()
@ 0x7fa004150157 caffe::Net<>::Init()
@ 0x7fa00415289e caffe::Net<>::Net()
@ 0x7fa00410eb3a caffe::Solver<>::InitTrainNet()
@ 0x7fa004110005 caffe::Solver<>::Init()
@ 0x7fa00411031f caffe::Solver<>::Solver()
@ 0x7fa00412e4b1 caffe::Creator_SGDSolver<>()
@ 0x40a788 train()
@ 0x407578 main
@ 0x7fa0020c2830 __libc_start_main
@ 0x407e49 _start
@ 0x0 (unknown)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.