- Title: Wide Residual Networks
- Authors: Sergey Zagoruyko, Nikos Komodakis
- Link: http://arxiv.org/abs/1605.07146v1
- Tags: Neural Network, residual
- Year: 2016
-
What
- The authors start with a standard ResNet architecture (i.e. residual network has suggested in "Identity Mappings in Deep Residual Networks").
- They empirically try to answer the following questions:
- How many residual blocks are optimal? (Depth)
- How many filters should be used per convolutional layer? (Width)
- How many convolutional layers should be used per residual block?
- Does Dropout between the convolutional layers help?
-
Results
- Layers per block and kernel sizes:
- Width and depth:
- Increasing the width considerably improves the test error.
- They achieve the best results (on CIFAR-10) when decreasing the depth to 28 convolutional layers, with each having 10 times their normal width (i.e. 16*10 filters, 32*10 and 64*10):
- They argue that their results show no evidence that would support the common theory that thin and deep networks somehow regularized better than wide and shallow(er) networks.
- Dropout:
- They use dropout with p=0.3 (CIFAR) and p=0.4 (SVHN).
- On CIFAR-10 dropout doesn't seem to consistently improve test error.
- On CIFAR-100 and SVHN dropout seems to lead to improvements that are either small (wide and shallower net, i.e. depth=28, width multiplier=10) or significant (ResNet-50).
- They also observed oscillations in error (both train and test) during the training. Adding dropout decreased these oscillations.
- Computational efficiency:
- Applying few big convolutions is much more efficient on GPUs than applying many small ones sequentially.
- Their network with the best test error is 1.6 times faster than ResNet-1001, despite having about 3 times more parameters.