This is quick evaluation of different activation functions performance on ImageNet-2012.
The architecture is similar to CaffeNet, but has differences:
- Images are resized to small side = 128 for speed reasons.
- fc6 and fc7 layers have 2048 neurons instead of 4096.
- Networks are initialized with LSUV-init
Because LRN layers add nothing to accuracy, they were removed for speed reasons in further experiments. *ELU curves is unsmooth because of incorrectly set test set size. However, results from 310K to 320K iterations are obtained with fixed set size
Name | Accuracy | LogLoss | Comments |
---|---|---|---|
ReLU | 0.470 | 2.36 | With LRN layers |
ReLU | 0.471 | 2.36 | No LRN, as in rest |
TanH | 0.401 | 2.78 | |
1.73TanH(2x/3) | 0.423 | 2.66 | As recommended in Efficient BackProp, LeCun98 |
ArcSinH | 0.417 | 2.71 | |
VLReLU | 0.469 | 2.40 | y=max(x,x/3) |
RReLU | 0.478 | 2.32 | |
Maxout | 0.482 | 2.30 | sqrt(2) narrower layers, 2 pieces. Same complexity, as for ReLU |
Maxout | 0.517 | 2.12 | same width layers, 2 pieces |
PReLU | 0.485 | 2.29 | |
ELU | 0.488 | 2.28 | alpha=1, as in paper |
ELU | 0.485 | 2.29 | alpha=0.5 |
(ELU+LReLU) / 2 | 0.486 | 2.28 | alpha=1, slope=0.05 |
Shifted Softplus | 0.486 | 2.29 | Shifted BNLL aka softplus, y = log(1 + exp(x)) - log(2). Same as ELU, as expected |
SELU = Scaled ELU | 0.470 | 2.38 | 1.05070 * ELU(x,alpha = 1.6732) |
FReLU = ReLU + (learned) bias | 0.488 | 2.27 | |
[FELU = ELU + (learned) bias] | 0.489 | 2.28 | |
No | 0.389 | 2.93 | No non-linearity, with max-pooling |
No, no max pooling | 0.035 | 6.28 | No non-linearity, strided convolution |
APL2 | 0.471 | 2.38 | 2 linear pieces. Unlike other activations, current author`s implementation leads to different parameters for each x,y position of neuron |
APL5 | 0.465 | 2.39 | 5 linear pieces. Unlike other activations, current author`s implementation leads to different parameters for each x,y position of neuron |
ConvReLU,FCMaxout2 | 0.490 | 2.26 | ReLU in convolution, Maxout (sqrt(2) narrower) 2 pieces in FC. Inspired by kaggle and INVESTIGATION OF MAXOUT NETWORKS FOR SPEECH RECOGNITION* |
ConvELU,FCMaxout2 | 0.499 | 2.22 | ELU in convolution, Maxout (sqrt(2) narrower) 2 pieces in FC. |
The above analyses show that the bottom layers seem to waste a large portion of the additional parametrisation (figure 2 (a,e)) thus could be replaced, for example, by smaller ReLU layers. Similarly, maxout units in higher layers seem to use piecewise-linear components in a more active way suggesting the use of larger pools._
Name | Accuracy | LogLoss | Comments |
---|---|---|---|
ReLU | 0.499 | 2.21 | |
RReLU | 0.500 | 2.20 | |
PReLU | 0.503 | 2.19 | |
ELU | 0.498 | 2.23 | |
Maxout | 0.487 | 2.28 | |
Sigmoid | 0.475 | 2.35 | |
No | 0.384 | 2.96 |
Previous results on small datasets like CIFAR (see LSUV-init, Table3) looks a bit contradictory to ImageNet ones so far.
Maxout net has two linear pieces and each has sqrt(2) less parameters than *ReLU networks, so overall complexity is same.
P.S. Logs are merged from lots of "save-resume", because were trained at nights, so plot "Accuracy vs. seconds" will give weird results.