Use the default configuration, the acc always equals to 0 #74

YihangLou · 2018-03-03T08:46:49Z

Dear author, I used the default configuration hope to reproduce the resutls, the acc alwyas show 0 after 2000 Batch. Is there anything wrong?
I only have 2 GPUs, so I changed the visible device to 0, 1
CUDA_VISIBLE_DEVICES='0,1' python -u train_softmax.py --network r100 --loss-type 4 --margin-m 0.5 --data-dir ../datasets/faces_ms1m_112x112 --prefix ../model-r100

Logfile

lyh@lyh-dell:~/workspace/insightface/src$ MXNET_ENABLE_GPU_P2P=0 CUDA_VISIBLE_DEVICES='0,1' python -u train_softmax.py --network r100 --loss-type 4 --margin-m 0.5 --data-dir /data/faces_ms1m_112x112 --prefix ../model-r100
/home/lyh/anaconda2/lib/python2.7/site-packages/urllib3/contrib/pyopenssl.py:46: DeprecationWarning: OpenSSL.rand is deprecated - you should use os.urandom instead
import OpenSSL.SSL
gpu num: 2
num_layers 100
image_size [112, 112]
num_classes 85164
Called with argument: Namespace(batch_size=120, beta=1000.0, beta_freeze=0, beta_min=5.0, c2c_mode=-10, c2c_threshold=0.0, center_alpha=0.5, center_scale=0.003, ckpt=1, coco_scale=8.676161173096705, ctx_num=2, cutoff=0, data_dir='/data/faces_ms1m_112x112', easy_margin=0, emb_size=512, end_epoch=100000, gamma=0.12, image_channel=3, image_h=112, image_w=112, images_per_identity=0, incay=0.0, loss_type=4, lr=0.1, lr_steps='', margin=4, margin_a=0.0, margin_m=0.5, margin_s=64.0, margin_verbose=0, max_steps=0, mom=0.9, network='r100', noise_sgd=0.0, num_classes=85164, num_layers=100, output_c2c=0, patch='0_0_96_112_0', per_batch_size=60, power=1.0, prefix='../model-r100', pretrained='', rand_mirror=1, rescale_threshold=0, retrain=False, scale=0.9993, target='lfw,cfp_fp,agedb_30', triplet_alpha=0.3, triplet_bag_size=3600, triplet_max_ap=0.0, use_deformable=0, use_val=False, verbose=2000, version_input=1, version_output='E', version_se=0, version_unit=3, wd=0.0005)
init resnet 100
0 1 E 3
INFO:root:loading recordio /data/faces_ms1m_112x112/train.rec...
header0 label [ 3804847. 3890011.]
id2range 85164
0 0 3804846
c2c_stat [0, 85164]
3804846
rand_mirror 1
(120,)
loading bin 1000
loading bin 2000
loading bin 3000
loading bin 4000
loading bin 5000
loading bin 6000
loading bin 7000
loading bin 8000
loading bin 9000
loading bin 10000
loading bin 11000
loading bin 12000
(12000L, 3L, 112L, 112L)
ver lfw
loading bin 1000
loading bin 2000
loading bin 3000
loading bin 4000
loading bin 5000
loading bin 6000
loading bin 7000
loading bin 8000
loading bin 9000
loading bin 10000
loading bin 11000
loading bin 12000
loading bin 13000
loading bin 14000
(14000L, 3L, 112L, 112L)
ver cfp_fp
loading bin 1000
loading bin 2000
loading bin 3000
loading bin 4000
loading bin 5000
loading bin 6000
loading bin 7000
loading bin 8000
loading bin 9000
loading bin 10000
loading bin 11000
loading bin 12000
(12000L, 3L, 112L, 112L)
ver agedb_30
lr_steps [426666, 597333, 682666]
[16:29:38] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:107: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
/home/lyh/anaconda2/lib/python2.7/site-packages/mxnet/module/base_module.py:466: UserWarning: Optimizer created manually outside Module but rescale_grad is not normalized to 1.0/batch_size/num_workers (0.5 vs. 0.00833333333333). Is this intended?
optimizer_params=optimizer_params)
call reset()
INFO:root:Epoch[0] Batch [20] Speed: 211.74 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [40] Speed: 204.20 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [60] Speed: 200.59 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [80] Speed: 200.87 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [100] Speed: 204.17 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [120] Speed: 203.31 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [140] Speed: 202.20 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [160] Speed: 197.89 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [180] Speed: 196.42 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [200] Speed: 197.84 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [220] Speed: 199.84 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [240] Speed: 199.16 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [260] Speed: 199.51 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [280] Speed: 199.33 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [300] Speed: 199.42 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [320] Speed: 199.17 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [340] Speed: 199.13 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [360] Speed: 199.19 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [380] Speed: 199.55 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [400] Speed: 198.83 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [420] Speed: 196.18 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [440] Speed: 198.51 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [460] Speed: 199.76 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [480] Speed: 197.17 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [500] Speed: 198.87 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [520] Speed: 194.67 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [540] Speed: 192.28 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [560] Speed: 190.91 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [580] Speed: 193.99 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [600] Speed: 193.84 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [620] Speed: 190.89 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [640] Speed: 190.69 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [660] Speed: 191.62 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [680] Speed: 193.53 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [700] Speed: 189.48 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [720] Speed: 187.77 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [740] Speed: 189.28 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [760] Speed: 189.19 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [780] Speed: 190.49 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [800] Speed: 192.47 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [820] Speed: 195.38 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [840] Speed: 197.69 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [860] Speed: 197.44 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [880] Speed: 196.46 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [900] Speed: 194.44 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [920] Speed: 193.12 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [940] Speed: 197.04 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [960] Speed: 199.12 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [980] Speed: 196.32 samples/sec acc=0.000000
lr-batch-epoch: 0.1 999 0
INFO:root:Epoch[0] Batch [1000] Speed: 197.65 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1020] Speed: 198.27 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1040] Speed: 199.45 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1060] Speed: 198.46 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1080] Speed: 198.32 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1100] Speed: 197.28 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1120] Speed: 195.61 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1140] Speed: 199.11 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1160] Speed: 199.16 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1180] Speed: 196.99 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1200] Speed: 195.02 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1220] Speed: 197.64 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1240] Speed: 191.67 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1260] Speed: 187.53 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1280] Speed: 190.14 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1300] Speed: 189.53 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1320] Speed: 190.74 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1340] Speed: 189.12 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1360] Speed: 189.72 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1380] Speed: 188.79 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1400] Speed: 188.42 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1420] Speed: 189.92 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1440] Speed: 188.11 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1460] Speed: 190.44 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1480] Speed: 189.30 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1500] Speed: 190.49 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1520] Speed: 190.19 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1540] Speed: 189.77 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1560] Speed: 190.97 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1580] Speed: 188.68 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1600] Speed: 189.51 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1620] Speed: 190.87 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1640] Speed: 191.33 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1660] Speed: 194.82 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1680] Speed: 190.51 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1700] Speed: 191.16 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1720] Speed: 196.51 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1740] Speed: 192.80 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1760] Speed: 197.37 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1780] Speed: 199.04 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1800] Speed: 197.57 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1820] Speed: 199.10 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1840] Speed: 199.07 samples/sec acc=0.000000

mike07026 · 2018-03-04T05:52:02Z

Refer to the model(log) author published, this algorithm trained with 512 batch size should get acc > 0 after about 2k batches.But your batch size is 120, which is much smaller than 512, so I guess the algorithm haven‘t seen enough samples yet. (2000x120 vs 2000x512)
I suggest you should wait about at lease 512/128*2k=10k batches.

nttstar · 2018-03-05T02:32:53Z

@mike07026 is Right. And also try to use smaller network and larger batch size to obtain a stable result.

eladrich · 2018-03-05T08:48:10Z

Hi,
Encountered the same problem when training with the default parameters, but batch size of 32.

I understand this should differ from the original results but the model has been training for over an epoch already without any increase in accuracy, wouldn't you expect at least some improvement?

nttstar · 2018-03-05T09:07:49Z

@eladrich batch size of 32 is too small. I suggest to use at least 128. My experiments use 512(128*4).

zhangjiekui · 2018-03-14T09:33:27Z

That‘s right!

CUDA_VISIBLE_DEVICES='0,1' python -u train_softmax.py --network r100 --loss-type 4 --margin-m 0.5 --data-dir ../datasets/faces_ms1m_112x112 --prefix ../model-r100 --per-batch-size 32

![acc log]

(https://user-images.githubusercontent.com/33198334/37393912-740b3d52-27ad-11e8-8152-4b18533ce020.jpg)

xxllp · 2018-04-18T06:18:15Z

@zhangjiekui ，how long to train ？

yxchng · 2019-03-01T03:54:25Z

@nttstar Is it not possible to train using small batch size 32 or 64 (data=MS1M-ArcFace, m=0.5)? Or is it just going to take long? Do you have any idea why SphereFace loss and CosFace loss converge faster (does not stay at 0 for very long) even with small batch size?

nttstar closed this as completed Mar 5, 2018

FlyingAle mentioned this issue Jun 2, 2023

2d106det convert to ncnn model,run on android will be crushed #2322

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use the default configuration, the acc always equals to 0 #74

Use the default configuration, the acc always equals to 0 #74

YihangLou commented Mar 3, 2018 •

edited

Loading

mike07026 commented Mar 4, 2018

nttstar commented Mar 5, 2018

eladrich commented Mar 5, 2018

nttstar commented Mar 5, 2018

zhangjiekui commented Mar 14, 2018

xxllp commented Apr 18, 2018

yxchng commented Mar 1, 2019 •

edited

Loading

Use the default configuration, the acc always equals to 0 #74

Use the default configuration, the acc always equals to 0 #74

Comments

YihangLou commented Mar 3, 2018 • edited Loading

mike07026 commented Mar 4, 2018

nttstar commented Mar 5, 2018

eladrich commented Mar 5, 2018

nttstar commented Mar 5, 2018

zhangjiekui commented Mar 14, 2018

xxllp commented Apr 18, 2018

yxchng commented Mar 1, 2019 • edited Loading

YihangLou commented Mar 3, 2018 •

edited

Loading

yxchng commented Mar 1, 2019 •

edited

Loading