Welcome to evaluation of CNN design choises performance on ImageNet-2012. Here you can find prototxt's of tested nets and full train logs.

upd.: Here is technical report version of this benchmark

If you use results from this benchmark, please cite

@Article{CaffeNetBench2017,
  Title                    = {Systematic evaluation of convolution neural network advances on the Imagenet },
  Author                   = {Dmytro Mishkin and Nikolay Sergievskiy and Jiri Matas},
  Journal                  = {Computer Vision and Image Understanding },
  Year                     = {2017},
  Doi                      = {https://doi.org/10.1016/j.cviu.2017.05.007},
  ISSN                     = {1077-3142},
  Keywords                 = {CNN},
  Url                      = {http://www.sciencedirect.com/science/article/pii/S1077314217300814}
}

**upd2.: Some of the pretrained models are in Releases section. They are licensed for unrestricted use.

***upd3.: Nice paper on noise sensitiveness: Fine-grained Recognition in the Noisy Wild: Sensitivity Analysis of Convolutional Neural Networks Approaches

The basic architecture is similar to CaffeNet, but has several differences:

Images are resized to small side = 128 for speed reasons. Therefore pool5 spatial size is 3x3 instead of 6x6.
fc6 and fc7 layers have 2048 neurons instead of 4096.
Networks are initialized with LSUV-init (code)
Because LRN layers add nothing to accuracy (validated here), they were removed for speed reasons in most experiments.

Taking into account Neural Network Training Variations in Speech and Subsequent Performance Evaluation, results can vary from run to run (data order is the same, but random seeds are different). However, I haven`t experienced results difference for several CaffeNet-ReLU training runs.

On-going evaluations with graphs:

activations
pooling
solvers
lr_policy
architectures
First layer parameters
Conv1 depth
classfier architectures
augmentation
batchnorm
colorspace
regularization
resnets, not yet successfull
batch size
dataset size
Network width
other mix

Activations

Name	Accuracy	LogLoss	Comments
ReLU	0.470	2.36	With LRN layers
ReLU	0.471	2.36	No LRN, as in rest
TanH	0.401	2.78
1.73TanH(2x/3)	0.423	2.66	As recommended in Efficient BackProp, LeCun98
ArcSinH	0.417	2.71
VLReLU	0.469	2.40	y=max(x,x/3)
RReLU	0.478	2.32
Maxout	0.482	2.30	sqrt(2) narrower layers, 2 pieces. Same complexity, as for ReLU
Maxout	0.517	2.12	same width layers, 2 pieces
PReLU	0.485	2.29
ELU	0.488	2.28	alpha=1, as in paper
ELU	0.485	2.29	alpha=0.5
(ELU+LReLU) / 2	0.486	2.28	alpha=1, slope=0.05
SELU = Scaled ELU	0.470	2.38	1.05070 * ELU(x,alpha = 1.6732)
FReLU = ReLU + (learned) bias	0.488	2.27
[FELU = ELU + (learned) bias]	0.489	2.28
Shifted Softplus	0.486	2.29	Shifted BNLL aka softplus, y = log(1 + exp(x)) - log(2). Same as ELU, as expected
No, with max pooling	0.389	2.93	No non-linearity
No, no max pooling	0.035	6.28	No non-linearity, strided convolution
APL2	0.471	2.38	2 linear pieces. Unlike other activations, current author`s implementation leads to different parameters for each x,y position of neuron
APL5	0.465	2.39	5 linear pieces. Unlike other activations, current author`s implementation leads to different parameters for each x,y position of neuron
ConvReLU,FCMaxout2	0.490	2.26	ReLU in convolution, Maxout (sqrt(2) narrower) 2 pieces in FC. Inspired by kaggle and INVESTIGATION OF MAXOUT NETWORKS FOR SPEECH RECOGNITION*
ConvELU,FCMaxout2	0.499	2.22	ELU in convolution, Maxout (sqrt(2) narrower) 2 pieces in FC.

The above analyses show that the bottom layers seem to waste a large portion of the additional parametrisation (figure 2 (a,e)) thus could be replaced, for example, by smaller ReLU layers. Similarly, maxout units in higher layers seem to use piecewise-linear components in a more active way suggesting the use of larger pools._

Name	Accuracy	LogLoss	Comments
MaxPool	0.471	2.36
Stochastic	0.438	2.54	Underfitting, may be try without Dropout
Stochastic, no dropout	0.429	2.96	Stoch pool does not prevent overfitting without dropout :(. Good start,bad finish
AvgPool	0.435	2.56
Max+AvgPool	0.483	2.29	Element-wise sum
NoPool	0.472	2.35	Strided conv2,conv3,conv4
General	-	-	Depends on arch, click for details

Name	Accuracy	LogLoss	Comments
MaxPool 3x3/2	0.471	2.36	default alexnet
MaxPool 2x2/2	0.484	2.29	Leads to larger feature map, Pool5=4x4 instead of 3x3
MaxPool 3x3/2 pad1	0.488	2.25	Leads to even larger feature map, Pool5=5x5 instead of 3x3

Name	Accuracy	LogLoss	Comments
Default ReLU	0.470	2.36	fc6 = conv 3x3x2048 -> fc7 2048 -> 1000 fc8
Conv5-fc6=2048C3_2048C1_clf_avg	0.494	2.34	no pool5 -> fc6 = conv 3x3x2048 -> fc7=conv 1x1x2048 -> fc8 as 1x1 conv -> ave_pool.
Pool5-fc6=2048C3_2048C1_avg_clf	0.489	2.28	no pool5 -> fc6 = conv 3x3x2048 -> fc7=conv 1x1x2048 -> ave_pool -> fc8
SPP2-FC-FC	0.471	2.36	pool5 = SPP with 2 levels (2x2 and 1x1) -> FC6 -> FC7
SPP3-FC-FC	0.483	2.30	pool5 = SPP with 3 levels (3x3 and 2x2 and 1x1) -> FC6 -> FC7
fc6=512C3_1024C3_1536C1	0.482	2.52	pool5 zero pad -> fc6 = conv 3x3x512 -> fc7=conv 3x3x1024 -> 1x1x1536 -> fc8 as 1x1 conv -> ave_pool.
fc6=512C3_1024C3_1536C1_drop	0.491	2.29	pool5 zero pad -> fc6 = conv 3x3x512 -> fc7=conv 3x3x1024 -> drop 0.3 -> 1x1x1536 -> drop 0.5-> fc8 as 1x1 conv -> ave_pool.
Default ReLU, 4096	0.497	2.24	fc6 = conv 3x3x4096 -> fc7 4096 -> 1000 fc8 == original caffenet

Name	Accuracy	LogLoss	Comments
Default ELU	0.488	2.28	fc6 = conv 3x3x2048 -> fc7 2048 -> 1000 fc8
pool5pad_fc6ave	0.481	2.32	pool5 zero pad -> fc6 = conv 3x3x2048 -> AvePool -> as usual
pool5pad_fc6ave_fc7as1x1fc8ave	0.511	2.21	pool5 zero pad -> fc6 = conv 3x3x2048 -> fc7 as 1x1 conv -> fc8 as 1x1 conv -> ave_pool.
pool5pad_fc6ave_fc7as1x1avefc8	0.508	2.22	pool5 zero pad -> fc6 = conv 3x3x2048 -> fc7 as 1x1 conv -> ave_pool -> fc8
pool5pad_fc6ave_fc7as1x1_avemax_fc8	0.509	2.19	pool5 zero pad -> fc6 = conv 3x3x2048 -> fc7 as 1x1 conv -> fc8 as 1x1 conv -> ave_pool + max_pool.

Name	Accuracy	LogLoss	Comments
Default, 128_K11_S4	0.471	2.36	Input size =128x128px, conv1 = 11x11x96, stride = 4
224_K11_S8	0.453	2.45	Input size =256x256px, conv1 = 11x11x96, stride = 8. Not finished yet
160_K11_S5	0.470	2.35	Input size =160x160px, conv1 = 11x11x96, stride = 5
96_K7_S3	0.459	2.43	Input size =96x96px, conv1 = 7x7x96, stride = 3
64_K5_S2	0.445	2.50	Input size =64x64px, conv1 = 5x5x96, stride = 2
32_K3_S1	0.390	2.84	Input size =32x32px, conv1 = 3x3x96, stride = 1
4x slower, 227_K11_S4	0.565	1.87	Input size = 227x227px, conv1 = 11x11x96, stride = 4, Not finished yet

Name	Accuracy	LogLoss	Comments
SGD with momentum	0.471	2.36
Nesterov	0.473	2.34
RMSProp	0.327	3.20	rms_decay=0.9, delta=1.0
RMSProp	0.453	2.45	rms_decay=0.9, delta=1.0, base_lr: 0.045, stepsize=10K. gamma=0.94 (from here)
RMSProp	0.451	2.43	rms_decay=0.9, delta=1.0, base_lr: 0.1, stepsize=10K. gamma=0.94
RMSProp	0.472	2.36	rms_decay=0.9, delta=1.0, base_lr: 0.1, stepsize=5K. gamma=0.94
RMSProp	0.486	2.28	rms_decay=0.9, delta=1.0, lr=0.1, linear lr_policy
SGD with momentum, linear	0.493	2.24	linear lr_policy

Name	Accuracy	LogLoss	Comments
Step 100K	0.471	2.36	Default caffenet solver, max_iter=320K
Poly lr, p=0.5, sqrt	0.483	2.29	bvlc_quick_googlenet_solver, All the way worse than "step", leading at finish
Poly lr, p=2.0, sqr	0.483	2.299
Poly lr, p=1.0, linear	0.493	2.24
Poly lr, p=1.0, linear	0.466	2.39	max_iter=160K
Exp, 0.035	0.441	2.53	max_iter=160K, stepsize=2K, gamma=0.915, same as in base_dereyly

Name	Accuracy	LogLoss	Comments
Step 100K	0.527	2.09	Default caffenet solver, max_iter=320K
Poly lr, p=1.0, linear	0.496	2.24	max_iter=105K,
Poly lr, p=1.0, start_lr=0.02	0.505	2.21	max_iter=105K
Exp, 0.035	0.506	2.19	max_iter=160K, stepsize=2K, gamma=0.915, same as in base_dereyly

Name	Accuracy	LogLoss	Comments
default	0.471	2.36	weight_decay=0.0005, L2, fc-dropout=0.5
wd=0.0001	0.450	2.48	weight_decay=0.0001, L2, fc-dropout=0.5
wd=0.00001	0.450	2.48	weight_decay=0.00001, L2, fc-dropout=0.5
wd=0.00001_L1	0.453	2.45	weight_decay=0.00001, L1, fc-dropout=0.5
drop=0.3	0.497	2.25	weight_decay=0.0005, L2, fc-dropout=0.3
drop=0.2	0.494	2.28	weight_decay=0.0005, L2, fc-dropout=0.2
drop=0.1	0.473	2.45	weight_decay=0.0005, L2, fc-dropout=0.1. Same acc, as in 0.5, but bigger logloss

Name	Accuracy	LogLoss	Comments
fc6,fc7=2048, dropout=0.5	0.471	2.36	default
fc6,fc7=2048, dropout=0.3	0.497	2.25	best for fc6,fc7=2048. (1-0.3)*2048=1433 neurons work each time
fc6,fc7=4096, dropout=0.65	0.465	2.38	(1-0.65)*4096=1433 neurons work each time
fc6,fc7=6144, dropout=0.77	0.447	2.48	(1-0.77)*6144=1433 neurons work each time
fc6,fc7=4096, dropout=0.5	0.497	2.24
fc6,fc7=1433, dropout=0	0.456	2.52

Name	Accuracy	LogLoss	Comments
CaffeNet256	0.565	1.87	Reference BVLC model, LSUV init
CaffeNet128	0.470	2.36	Pool5 = 3x3
CaffeNet128_4096	0.497	2.24	Pool5 = 3x3, fc6-fc7=4096
CaffeNet128All	0.530	2.05	All improvements without caffenet arch change: ELU + SPP + color_trans3-10-3 + Nesterov+ (AVE+MAX) Pool + linear lr_policy
	+ 0.06		Gain over vanilla caffenet128. "Sum of gains" = 0.018 + 0.013 + 0.015 + 0.003 + 0.013 + 0.023 = 0.085
SqueezeNet128	0.530	2.08	Reference solver, but linear lr_policy and batch_size=256 (320K iters). WITHOUT tricks like ELU, SPP, AVE+MAX, etc.
SqueezeNet128	0.547	2.08	New SqueezeNet solver. WITHOUT tricks like ELU, SPP, AVE+MAX, etc.
SqueezeNet224	0.592	1.80	New SqueezeNet solver. WITHOUT tricks like ELU, SPP, AVE+MAX, etc., 2 GPU
CaffeNet256All	0.613	1.64	All improvements without caffenet arch change: ELU + SPP + color_trans3-10-3 + Nesterov+ (AVE+MAX) Pool + linear lr_policy
CaffeNet128, no pad	0.411	2.70	No padding, but conv1 stride=2 instead of 4 to keep size of pool5 the same
CaffeNet128, dropout in conv	0.426	2.60	Dropout before pool2=0.1, after conv3 = 0.1, after conv4 = 0.2
CaffeNet128SPP	0.483	2.30	SPP= 3x3 + 2x2 + 1x1
DarkNet128BN	0.502	2.25	16C3->MP2->32C3->MP2->64C3->MP2->128C3->MP2->256C3->MP2->512C3->MP2->1024C3->1000CLF.BN
			+ PreLU + base_lr=0.035, exp lr_policy, 160K iters
NiN128	0.519	2.15	Step lr_policy. Be carefull to not use dropout on maxpool in-place

Name	Accuracy	LogLoss	Comments
DarkNetBN	0.502	2.25	16C3->MP2->32C3->MP2->64C3->MP2->128C3->MP2->256C3->MP2->512C3->MP2->1024C3->1000CLF.BN
HeNet2x2	0.561	1.88	No SPP, Pool5 = 3x3, VLReLU, J' from paper
HeNet3x1	0.560	1.88	No SPP, Pool5 = 3x3, VLReLU, J' from paper, 2x2->3x1
GoogLeNet128	0.619	1.61	linear lr_policy, batch_size=256. obviously slower than caffenet
[GoogLeNet128_BN_lim0606][https://github.com/lim0606/caffe-googlenet-bn]	0.645	1.54	BN before ReLU + scale bias, linear LR, batch_size = 128, base_lr = 0.005, 640K iter, LSUV init.!!!! 5x5 replaced by two 3x3, no in-place
GoogLeNet128Res	0.634	1.56	linear lr_policy, batch_size=256. Resudial connections between inception block. No BN
GoogLeNet128Res_color	0.638	1.52	linear lr_policy, batch_size=256. Resudial connections between inception block. No BN. + color_trans3-10-3
googlenet_loss2_clf	0.571	1.80	from net above, aux classifier after inception_4d
googlenet_loss1_clf	0.520	2.06	from net above, aux classifier after inception_4a
fitnet1_elu	0.333	3.21
VGGNet16_128	0.651	1.46	Surprisingly much better that GoogLeNet128, even with step-based solver.
VGGNet16_128_All	0.682	1.47	ELU (a=0.5. a=1 leads to divergence :( ), avg+max pool, color conversion, linear lr_policy

Name	Accuracy	LogLoss	Comments
ResNet-50ELU-2xThinner	0.616	1.63	Without BN, ELU, dropout=0.2 before classifier. 2x thinner, than in paper. Quite fast. No large overfitting (unlike upper table)
GoogLeNet-128	0.619	1.61	For reference. linear lr_policy, batch_size=256.
GoogLeNet128Res	0.634	1.56	linear lr_policy, batch_size=256. Resudial connections between inception block. No BN
VggLikeResNet-50-ELU-RoR-var	0.626	1.59	Step LR policy, max_iter = 200K, no BN, 4x thinner than VGG, Residual on residual .
VggLikeResNet-50-ELU	0.632	1.57	Step LR policy, max_iter = 200K, no BN, 4x thinner than VGG. More RoR .
VggLikeResNet-50-ELU-RoR 1x5	0.628	1.58	Step LR policy, max_iter = 200K, no BN, 4x thinner than VGG. 1x5 layers
VggLikeResNet-50-ELU-RoR 1x3	0.631	1.58	Step LR policy, max_iter = 200K, no BN, 4x thinner than VGG .

Name	Accuracy	LogLoss	Comments
Default	0.471	2.36	Random flip, random crop 128x128 from 144xN, N > 144
Drop 0.1	0.306	3.56	+ Input dropout 10%. not finished, 186K iters result
Multiscale	0.462	2.40	Random flip, random crop 128x128 from ( 144xN, - 50%, 188xN - 20%, 256xN - 20%, 130xN - 10%)
5 deg rot	0.448	2.47	Random rotation to [0..5] degrees.

Name	Accuracy	LogLoss	Comments
RGB	0.471	2.36	default, no changes. Input = 0.04 * (Img - [104, 117,124])
RGB_by_BN	0.469	2.38	Input = BatchNorm(Img)
CLAHE	0.467	2.38	RGB -> LAB -> CLAHE(L)->RGB->BatchNorm(RGB)
HISTEQ	0.448	2.48	RGB -> HiestEq
YCrCb	0.458	2.42	RGB->YCrCb->BatchNorm(YCrCb)
HSV	0.451	2.46	RGB->HSV->BatchNorm(HSV)
Lab	-	-	Doesn`t leave 6.90 loss after 1.5K iters
RGB->10->3 TanH	0.463	2.40	RGB -> conv1x1x10 tanh -> conv1x1x3 tanh
RGB->10->3 VlReLU	0.485	2.28	RGB -> conv1x1x10 vlrelu -> conv1x1x3 vlrelu
RGB->10->3 Maxout	0.488	2.26	RGB -> conv1x1x10 maxout(2) -> conv1x1x3 maxout(2)
RGB->16->3 VlReLU	0.483	2.30	RGB -> conv1x1x16 vlrelu -> conv1x1x3 vlrelu
RGB->3->3 VlReLU	0.480	2.32	RGB -> conv1x1x3 vlrelu -> conv1x1x3 vlrelu
RGB->10->3 VlReLU->sum(RGB)	0.482	2.30	RGB -> conv1x1x10 vlrelu -> conv1x1x3 -> sum(RGB) ->vlrelu
RGB and log(RGB)->10->3 VlReLU	0.482	2.29	RGB and log (RGB) -> conv1x1x10 vlrelu -> conv1x1x3 vlrelu
RGB and log(RGB) and log (256-RGB)->10->3 VlReLU	0.484	2.29	RGB and log (RGB) and log (256 - RGB) -> conv1x1x10 vlrelu -> conv1x1x3 vlrelu
NN-Scale	0.467	2.38	Nearest neightbor instead of linear interpolation for rescale. Faster, but worse :(
concat_rgb_each_pool	0.441	2.51	Concat avepoolRGB with each pool
OpenCV RGB2Gray	0.413	2.70	RGB->Grayscale Gray = 0.299 R + 0.587 G + 0.114 B
Learned RGB2Gray	0.419	2.66	RGB->conv1x1x1. Gray = -1.779 R + 6.511 G + 1.493 *B + 3.279

Name	Accuracy	LogLoss	Comments
Before	0.474	2.35	As in paper
Before + scale&bias layer	0.478	2.33	As in paper
After	0.499	2.21
After + scale&bias layer	0.493	2.24

Name	Accuracy	LogLoss
ReLU	0.499	2.21
RReLU	0.500	2.20
PReLU	0.503	2.19
ELU	0.498	2.23
Maxout	0.487	2.28
Sigmoid	0.475	2.35
TanH	0.448	2.50
No	0.384	2.96

Name	Accuracy	LogLoss
Dropout = 0.5	0.499	2.21
Dropout = 0.2	0.527	2.09
Dropout = 0	0.513	2.19

Name	Accuracy	LogLoss
Caffenet	0.471	2.36
Caffenet BN Before + scale&bias layer LSUV	0.478	2.33
Caffenet BN Before + scale&bias layer Ortho	0.482	2.31
Caffenet BN After LSUV	0.499	2.21
Caffenet BN After Ortho	0.500	2.20

Name	Accuracy	LogLoss	Comments
BS=1024, 4xlr	0.465	2.38	lr=0.04, 80K iters
BS=1024	0.419	2.65	lr=0.01, 80K iters
BS=512, 2xlr	0.469	2.37	lr=0.02, 160K iters
BS=512	0.455	2.46	lr=0.01, 160K iters
BS=256, default	0.471	2.36	lr=0.01, 320K iters
BS=128	0.472	2.35	lr=0.01, 640K iters
BS=128, 1/2 lr	0.470	2.36	lr=0.005, 640K iters
BS=64	0.471	2.34	lr=0.01, 1280K iters
BS=64, 1/4 lr	0.475	2.34	lr=0.0025, 1280K iters
BS=32	0.463	2.40	lr=0.01, 2560K iter
BS=32, 1/8 lr	0.470	2.37	lr=0.00125, 2560K iter
BS=1, 1/256 lr	0.474	2.35	lr=3.9063e-05, 81920K iter. Online training

Name	Accuracy	LogLoss	Comments
Base	0.527	2.09
Base_dereyly_lr, noBN, ReLU	0.441	2.53	max_iter=160K, stepsize=2K, gamma=0.915, but default caffenet
Base_dereyly 5x1, noBN, ReLU	0.474	2.31	5x5->1x5+5x1
Base_dereyly_PReLU	0.550	1.93	BN, PreLU + base_lr=0.035, exp lr_policy, 160K iters, 5x5->3x3+3x3
Base_dereyly 3x1	0.553	1.92	PreLU + base_lr=0.035, exp lr_policy, 160K iters, 5x5->1x3+1x3+3x1+1x3
Base_dereyly 3x1 scale aug	0.530	2.04	Same as previous, img: 128 crop from (128...300)px image, test resize to 144, crop 128
Base_dereyly 3x1 scale aug	0.512	2.17	Same as previous, img: 128 crop from (128...300)px image, test resize to (128+300)/2, crop 128
Base_dereyly 3x1->5x1	0.546	1.97*	PreLU + base_lr=0.035, exp lr_policy, 160K iters, 5x5->1x5+1x5+5x1+1x5
Base_dereyly 3x1,halfBN	0.544	1.95	PreLU + base_lr=0.035, exp lr_policy, 160K iters,5x5->1x3+1x3+3x1+1x3, BN only for pool and fc6
Base_dereyly 5x1	0.540	2.00	PreLU + base_lr=0.035, exp lr_policy, 160K iters, 5x5->1x5+5x1
DarkNetBN	0.502	2.25	16C3->MP2->32C3->MP2->64C3->MP2->128C3->MP2->256C3->MP2->512C3->MP2->1024C3->1000CLF.BN
			+ PreLU + base_lr=0.035, exp lr_policy, 160K iters

Name	Accuracy	LogLoss	Comments
VGG-Like	0.521	2.14	1st layer = 7x7 stride 2, unlike VGG. All other layer = 1/2 VGG width
VGG-LikeRes	0.576	1.83	with residual connections, no BN
VGG-LikeResDrop	0.568	1.91	with residual connections, no BN , dropout in conv

Name	Accuracy	LogLoss	Comments
4sqrt(2)x wider	0.565	1.96	Start overfitting
4x wider	0.563	1.92	Still no overfitting %)
2sqrt(2)x wider	0.552	1.94
2 wider	0.533	2.04
sqrt(2) wider	0.506	2.17
Default	0.471	2.36
sqrt(2)x narrower	0.460	2.41
2x narrower	0.416	2.68
2sqrt(2)x narrower	0.340	3.11	no group conv
2sqrt(2)x narrower	0.318	3.25
4x narrower	0.256	3.33

Name	Accuracy	LogLoss
Default, 1.2M images	0.471	2.36
800K images	0.438	2.54
600K images	0.425	2.63
400K images	0.393	2.92
200K images	0.305	4.04

Files

README.md

Latest commit

History

README.md

File metadata and controls

Activations

Pooling type

Pooling window/stride

CLF architecture

Conv1 parameters

Squeezing representation

Solvers

LR-policy

LR-policy-BatchNorm-Dropout = 0.2

Regularization

Dropout and width

Architectures

ResNets, good attempts

Train augmentation

Colorspace

Batch normalization

BN -- before or after ReLU?

BN and activations

BN and dropout

BN-arch-init

Batch size, ReLU

From contributors

Residual experiments

Network width

Dataset size

Dataset size, no RGB scaling

Input image size

Dataset quality

Conv1 depth

Other

Name	Accuracy	LogLoss	Comments
64x64	0.309	3.34
96x96	0.414	2.69
128x128	0.471	2.36
180x180	0.521	2.10
224x224	0.565	1.87
300x300	0.559	2.03	In progress, results for 115K

Name	Accuracy	LogLoss
Default, clean labels	0.471	2.36
5% incorrect labels	0.458	2.45
10% incorrect labels	0.447	2.58
15% incorrect labels	0.437	2.69
50% incorrect labels	0.347	3.44

Name	Accuracy	LogLoss	Comments
Default, no 1x1 or 3x3	0.471	2.36	conv1 -> pool1
+ 1x1x96 NiN	0.490	2.24	conv1 -> 96C1 -> pool1
+ 3x (1x1x96 NiN)	0.509	2.10	conv1 -> 3x(96C1) -> pool1
+ 5x (1x1x96 NiN)	0.514	2.11	conv1 -> 5x(96C1) -> pool1
+ 7x (1x1x96 NiN)	0.514	2.11	conv1 -> 7x(96C1) -> pool1
+ 9x (1x1x96 NiN)	0.516	2.10	conv1 -> 9x(96C1) -> pool1
+ 9x (1x1x96 NiN)R	0.509	2.13	conv1 -> Residual9x(96C1) -> pool1. 276k iters
+ 1x (3x3x96 NiN)	0.500	2.19	conv1 -> 1x(96C3) -> pool1
+ 3x (3x3x96 NiN)	0.538	1.99	conv1 -> 1x(96C3) -> pool1
+ 5x (3x3x96 NiN)	0.551	1.91	conv1 -> 1x(96C3) -> pool1