Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use the default configuration, the acc always equals to 0 #74

Closed
YihangLou opened this issue Mar 3, 2018 · 7 comments
Closed

Use the default configuration, the acc always equals to 0 #74

YihangLou opened this issue Mar 3, 2018 · 7 comments

Comments

@YihangLou
Copy link

YihangLou commented Mar 3, 2018

Dear author, I used the default configuration hope to reproduce the resutls, the acc alwyas show 0 after 2000 Batch. Is there anything wrong?
I only have 2 GPUs, so I changed the visible device to 0, 1
CUDA_VISIBLE_DEVICES='0,1' python -u train_softmax.py --network r100 --loss-type 4 --margin-m 0.5 --data-dir ../datasets/faces_ms1m_112x112 --prefix ../model-r100

Logfile

lyh@lyh-dell:~/workspace/insightface/src$ MXNET_ENABLE_GPU_P2P=0 CUDA_VISIBLE_DEVICES='0,1' python -u train_softmax.py --network r100 --loss-type 4 --margin-m 0.5 --data-dir /data/faces_ms1m_112x112 --prefix ../model-r100
/home/lyh/anaconda2/lib/python2.7/site-packages/urllib3/contrib/pyopenssl.py:46: DeprecationWarning: OpenSSL.rand is deprecated - you should use os.urandom instead
import OpenSSL.SSL
gpu num: 2
num_layers 100
image_size [112, 112]
num_classes 85164
Called with argument: Namespace(batch_size=120, beta=1000.0, beta_freeze=0, beta_min=5.0, c2c_mode=-10, c2c_threshold=0.0, center_alpha=0.5, center_scale=0.003, ckpt=1, coco_scale=8.676161173096705, ctx_num=2, cutoff=0, data_dir='/data/faces_ms1m_112x112', easy_margin=0, emb_size=512, end_epoch=100000, gamma=0.12, image_channel=3, image_h=112, image_w=112, images_per_identity=0, incay=0.0, loss_type=4, lr=0.1, lr_steps='', margin=4, margin_a=0.0, margin_m=0.5, margin_s=64.0, margin_verbose=0, max_steps=0, mom=0.9, network='r100', noise_sgd=0.0, num_classes=85164, num_layers=100, output_c2c=0, patch='0_0_96_112_0', per_batch_size=60, power=1.0, prefix='../model-r100', pretrained='', rand_mirror=1, rescale_threshold=0, retrain=False, scale=0.9993, target='lfw,cfp_fp,agedb_30', triplet_alpha=0.3, triplet_bag_size=3600, triplet_max_ap=0.0, use_deformable=0, use_val=False, verbose=2000, version_input=1, version_output='E', version_se=0, version_unit=3, wd=0.0005)
init resnet 100
0 1 E 3
INFO:root:loading recordio /data/faces_ms1m_112x112/train.rec...
header0 label [ 3804847. 3890011.]
id2range 85164
0 0 3804846
c2c_stat [0, 85164]
3804846
rand_mirror 1
(120,)
loading bin 1000
loading bin 2000
loading bin 3000
loading bin 4000
loading bin 5000
loading bin 6000
loading bin 7000
loading bin 8000
loading bin 9000
loading bin 10000
loading bin 11000
loading bin 12000
(12000L, 3L, 112L, 112L)
ver lfw
loading bin 1000
loading bin 2000
loading bin 3000
loading bin 4000
loading bin 5000
loading bin 6000
loading bin 7000
loading bin 8000
loading bin 9000
loading bin 10000
loading bin 11000
loading bin 12000
loading bin 13000
loading bin 14000
(14000L, 3L, 112L, 112L)
ver cfp_fp
loading bin 1000
loading bin 2000
loading bin 3000
loading bin 4000
loading bin 5000
loading bin 6000
loading bin 7000
loading bin 8000
loading bin 9000
loading bin 10000
loading bin 11000
loading bin 12000
(12000L, 3L, 112L, 112L)
ver agedb_30
lr_steps [426666, 597333, 682666]
[16:29:38] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:107: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
/home/lyh/anaconda2/lib/python2.7/site-packages/mxnet/module/base_module.py:466: UserWarning: Optimizer created manually outside Module but rescale_grad is not normalized to 1.0/batch_size/num_workers (0.5 vs. 0.00833333333333). Is this intended?
optimizer_params=optimizer_params)
call reset()
INFO:root:Epoch[0] Batch [20] Speed: 211.74 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [40] Speed: 204.20 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [60] Speed: 200.59 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [80] Speed: 200.87 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [100] Speed: 204.17 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [120] Speed: 203.31 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [140] Speed: 202.20 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [160] Speed: 197.89 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [180] Speed: 196.42 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [200] Speed: 197.84 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [220] Speed: 199.84 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [240] Speed: 199.16 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [260] Speed: 199.51 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [280] Speed: 199.33 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [300] Speed: 199.42 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [320] Speed: 199.17 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [340] Speed: 199.13 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [360] Speed: 199.19 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [380] Speed: 199.55 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [400] Speed: 198.83 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [420] Speed: 196.18 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [440] Speed: 198.51 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [460] Speed: 199.76 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [480] Speed: 197.17 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [500] Speed: 198.87 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [520] Speed: 194.67 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [540] Speed: 192.28 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [560] Speed: 190.91 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [580] Speed: 193.99 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [600] Speed: 193.84 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [620] Speed: 190.89 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [640] Speed: 190.69 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [660] Speed: 191.62 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [680] Speed: 193.53 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [700] Speed: 189.48 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [720] Speed: 187.77 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [740] Speed: 189.28 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [760] Speed: 189.19 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [780] Speed: 190.49 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [800] Speed: 192.47 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [820] Speed: 195.38 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [840] Speed: 197.69 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [860] Speed: 197.44 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [880] Speed: 196.46 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [900] Speed: 194.44 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [920] Speed: 193.12 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [940] Speed: 197.04 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [960] Speed: 199.12 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [980] Speed: 196.32 samples/sec acc=0.000000
lr-batch-epoch: 0.1 999 0
INFO:root:Epoch[0] Batch [1000] Speed: 197.65 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1020] Speed: 198.27 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1040] Speed: 199.45 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1060] Speed: 198.46 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1080] Speed: 198.32 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1100] Speed: 197.28 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1120] Speed: 195.61 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1140] Speed: 199.11 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1160] Speed: 199.16 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1180] Speed: 196.99 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1200] Speed: 195.02 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1220] Speed: 197.64 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1240] Speed: 191.67 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1260] Speed: 187.53 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1280] Speed: 190.14 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1300] Speed: 189.53 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1320] Speed: 190.74 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1340] Speed: 189.12 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1360] Speed: 189.72 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1380] Speed: 188.79 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1400] Speed: 188.42 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1420] Speed: 189.92 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1440] Speed: 188.11 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1460] Speed: 190.44 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1480] Speed: 189.30 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1500] Speed: 190.49 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1520] Speed: 190.19 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1540] Speed: 189.77 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1560] Speed: 190.97 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1580] Speed: 188.68 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1600] Speed: 189.51 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1620] Speed: 190.87 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1640] Speed: 191.33 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1660] Speed: 194.82 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1680] Speed: 190.51 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1700] Speed: 191.16 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1720] Speed: 196.51 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1740] Speed: 192.80 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1760] Speed: 197.37 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1780] Speed: 199.04 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1800] Speed: 197.57 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1820] Speed: 199.10 samples/sec acc=0.000000
INFO:root:Epoch[0] Batch [1840] Speed: 199.07 samples/sec acc=0.000000

@mike07026
Copy link

Refer to the model(log) author published, this algorithm trained with 512 batch size should get acc > 0 after about 2k batches.But your batch size is 120, which is much smaller than 512, so I guess the algorithm haven‘t seen enough samples yet. (2000x120 vs 2000x512)
I suggest you should wait about at lease 512/128*2k=10k batches.

@nttstar
Copy link
Collaborator

nttstar commented Mar 5, 2018

@mike07026 is Right. And also try to use smaller network and larger batch size to obtain a stable result.

@nttstar nttstar closed this as completed Mar 5, 2018
@eladrich
Copy link

eladrich commented Mar 5, 2018

Hi,
Encountered the same problem when training with the default parameters, but batch size of 32.

I understand this should differ from the original results but the model has been training for over an epoch already without any increase in accuracy, wouldn't you expect at least some improvement?

@nttstar
Copy link
Collaborator

nttstar commented Mar 5, 2018

@eladrich batch size of 32 is too small. I suggest to use at least 128. My experiments use 512(128*4).

@zhangjiekui
Copy link

That‘s right!

CUDA_VISIBLE_DEVICES='0,1' python -u train_softmax.py --network r100 --loss-type 4 --margin-m 0.5 --data-dir ../datasets/faces_ms1m_112x112 --prefix ../model-r100 --per-batch-size 32

![acc log]

(https://user-images.githubusercontent.com/33198334/37393912-740b3d52-27ad-11e8-8152-4b18533ce020.jpg)

@xxllp
Copy link

xxllp commented Apr 18, 2018

@zhangjiekui ,how long to train ?

@yxchng
Copy link

yxchng commented Mar 1, 2019

@nttstar Is it not possible to train using small batch size 32 or 64 (data=MS1M-ArcFace, m=0.5)? Or is it just going to take long? Do you have any idea why SphereFace loss and CosFace loss converge faster (does not stay at 0 for very long) even with small batch size?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants