Easy benchmarking of all public open-source implementations of convnets. A summary is provided in the section below.
Machine: 6-core Intel Core i7-5930K CPU @ 3.50GHz
+ NVIDIA Titan X
+ Ubuntu 14.04 x86_64
##Imagenet Winners Benchmarking I pick some popular imagenet models, and I clock the time for a full forward + backward pass. I average my times over 10 runs. I ignored dropout and softmax layers.
AlexNet (One Weird Trick paper) - Input 128x3x224x224
Library | Class | Time (ms) | forward (ms) | backward (ms) |
---|---|---|---|---|
NervanaSys-16 | ConvLayer | 97 | 30 | 67 |
NervanaSys-32 | ConvLayer | 109 | 31 | 78 |
fbfft | SpatialConvolutionCuFFT | 136 | 45 | 91 |
cudaconvnet2* | ConvLayer | 177 | 42 | 135 |
CuDNN (R2) * | cudnn.SpatialConvolution | 231 | 70 | 161 |
Caffe (native) | ConvolutionLayer | 324 | 121 | 203 |
Torch-7 (native) | SpatialConvolutionMM | 342 | 132 | 210 |
Overfeat [fast] - Input 128x3x231x231
Library | Class | Time (ms) | forward (ms) | backward (ms) |
---|---|---|---|---|
NervanaSys-16 | ConvLayer | 364 | 119 | 245 |
NervanaSys-32 | ConvLayer | 410 | 126 | 284 |
fbfft | SpatialConvolutionCuFFT | 407 | 139 | 268 |
cudaconvnet2* | ConvLayer | 723 | 176 | 547 |
CuDNN (R2) * | cudnn.SpatialConvolution | 810 | 234 | 576 |
Caffe | ConvolutionLayer | 823 | 355 | 468 |
Torch-7 (native) | SpatialConvolutionMM | 878 | 379 | 499 |
OxfordNet [Model-A] - Input 64x3x224x224
Library | Class | Time (ms) | forward (ms) | backward (ms) |
---|---|---|---|---|
NervanaSys-16 | ConvLayer | 530 | 166 | 364 |
NervanaSys-32 | ConvLayer | 629 | 173 | 456 |
fbfft | SpatialConvolutionCuFFT | 1092 | 355 | 737 |
cudaconvnet2* | ConvLayer | 1229 | 408 | 821 |
CuDNN (R2) * | cudnn.SpatialConvolution | 1099 | 342 | 757 |
Caffe | ConvolutionLayer | 1068 | 323 | 745 |
Torch-7 (native) | SpatialConvolutionMM | 1105 | 350 | 755 |
###Spatial Convolution layer (3D input 3D output, densely connected)
Original Library | Class/Function Benchmarked | Time (ms) | forward (ms) | backward (ms) |
---|---|---|---|---|
fbfft | SpatialConvolutionCuFFT | 256 | 101 | 155 |
cuda-convnet2 * | ConvLayer | 977 | 201 | 776 |
cuda-convnet** | pylearn2.cuda_convnet | 1077 | 312 | 765 |
CuDNN R2 * | cudnn.SpatialConvolution | 1019 | 269 | 750 |
Theano | CorrMM | 1225 | 407 | 818 |
Caffe | ConvolutionLayer | 1231 | 396 | 835 |
Torch-7 | SpatialConvolutionMM | 1265 | 418 | 877 |
DeepCL | ConvolutionLayer | 6280 | 2648 | 3632 |
cherry-picking**** | best per layer | 235 | 79 | 155 |
This table is NOT UPDATED For TITAN-X. These numbers below were on Titan Black and are here only for informational and legacy purposes.
Original Library | Class/Function Benchmarked | Time (ms) | forward (ms) | backward (ms) |
---|---|---|---|---|
Theano (experimental)*** | conv2d_fft | 1178 | 304 | 874 |
Torch-7 | nn.SpatialConvolutionBHWD | 1892 | 581 | 1311 |
ccv | ccv_convnet_layer | 809+bw | 809 | |
Theano (legacy) | conv2d | 70774 | 3833 | 66941 |
- * indicates that the library was tested with Torch bindings of the specific kernels.
- ** indicates that the library was tested with Pylearn2 bindings.
- *** This is an experimental module which used FFT to calculate convolutions. It uses a lot of memory according to @benanne
- **** The last row shows results obtainable when choosing the best-performing library for each layer.
- L1 - Input:
128x128
Batch-size128
, Feature maps:3->96
, Kernel Size:11x11
, Stride:1x1
- L2 - Input:
64x64
Batch-size128
, Feature maps:64->128
, Kernel Size:9x9
, Stride:1x1
- L3 - Input:
32x32
Batch-size128
, Feature maps:128->128
, Kernel Size:9x9
, Stride:1x1
- L4 - Input:
16x16
Batch-size128
, Feature maps:128->128
, Kernel Size:7x7
, Stride:1x1
- L5 - Input:
13x13
Batch-size128
, Feature maps:384->384
, Kernel Size:3x3
, Stride:1x1
- The table is ranked according to the total time forward+backward calls for layers (L1 + L2 + L3 + L4 + L5)
#####Breakdown
Columns L1, L2, L3, L4, L5, Total are times in milliseconds
Original Library | Class/Function Benchmarked | L1 | L2 | L3 | L4 | L5 | Total |
---|---|---|---|---|---|---|---|
fbfft | SpatialConvolutionCuFFT | 57 | 27 | 6 | 2 | 9 | 101 |
cuda-convnet2 * | ConvLayer | 36 | 113 | 40 | 4 | 8 | 201 |
cuda-convnet** | pylearn2.cuda_convnet | 38 | 183 | 68 | 7 | 16 | 312 |
CuDNN R2 | cudnn.SpatialConvolution | 56 | 143 | 53 | 6 | 11 | 269 |
Theano | CorrMM | 91 | 143 | 121 | 24 | 28 | 407 |
Caffe | ConvolutionLayer<Dtype> | 93 | 136 | 116 | 24 | 27 | 396 |
Torch-7 | nn.SpatialConvolutionMM | 94 | 149 | 123 | 24 | 28 | 418 |
DeepCL | ConvolutionLayer | 738 | 1241 | 518 | 47 | 104 | 2648 |
cherry-picking**** | best per layer | 36 | 27 | 6 | 2 | 8 | 79 |
Columns L1, L2, L3, L4, L5, Total are times in milliseconds
Original Library | Class/Function Benchmarked | L1 | L2 | L3 | L4 | L5 | Total |
---|---|---|---|---|---|---|---|
fbfft | SpatialConvolutionCuFFT | 76 | 45 | 12 | 4 | 18 | 155 |
cuda-convnet2 * | ConvLayer | 103 | 467 | 162 | 15 | 29 | 776 |
cuda-convnet** | pylearn2.cuda_convnet | 136 | 433 | 147 | 15 | 34 | 765 |
CuDNN R2 | cudnn.SpatialConvolution | 139 | 401 | 159 | 19 | 32 | 750 |
Theano | CorrMM | 179 | 405 | 174 | 29 | 31 | 818 |
Caffe | ConvolutionLayer<Dtype> | 200 | 405 | 172 | 28 | 30 | 835 |
Torch-7 | nn.SpatialConvolutionMM | 206 | 432 | 178 | 29 | 32 | 877 |
DeepCL | ConvolutionLayer | 484 | 2144 | 747 | 59 | 198 | 3632 |
cherry-picking**** | best per layer | 76 | 45 | 12 | 4 | 18 | 155 |