Memory Errors, would appreciate more detailed requirement info in README #55

matthew-mcateer · 2018-02-19T20:53:53Z

After going through the instructions for adding the dataset, and adding the dependencies, and making sure I'm within the src folder in the repository, I enter the following to train InsightFace on LResNet100E-IR (this has been modified as my machine only has one GPU):

CUDA_VISIBLE_DEVICES='0' python -u train_softmax.py --network r100 --loss-type 4 --margin-m 0.5 --data-dir ../datasets/faces_ms1m_112x112 --prefix ../model-r100

However, I get the following output:

gpu num: 1
num_layers 100
image_size [112, 112]
num_classes 85164
Called with argument: Namespace(batch_size=128, beta=1000.0, beta_freeze=0, beta_min=5.0, c2c_mode=-10, c2c_threshold=0.0, center_alpha=0.5, center_scale=0.003, ckpt=1, coco_scale=8.676161173096705, ctx_num=1, data_dir='../datasets/faces_ms1m_112x112', easy_margin=0, emb_size=512, end_epoch=100000, gamma=0.12, image_channel=3, image_h=112, image_w=112, images_per_identity=0, incay=0.0, loss_type=4, lr=0.1, lr_steps='', margin=4, margin_m=0.5, margin_s=64.0, margin_verbose=0, max_steps=0, mom=0.9, network='r100', num_classes=85164, num_layers=100, output_c2c=0, patch='0_0_96_112_0', per_batch_size=128, power=1.0, prefix='../model-r100', pretrained='', rand_mirror=1, rescale_threshold=0, retrain=False, scale=0.9993, target='lfw,cfp_ff,cfp_fp,agedb_30', triplet_alpha=0.3, triplet_bag_size=3600, triplet_max_ap=0.0, use_deformable=0, use_val=False, verbose=2000, version_input=1, version_output='E', version_se=0, version_unit=3, wd=0.0005)
init resnet 100
0 1 E 3
INFO:root:loading recordio ../datasets/faces_ms1m_112x112/train.rec...
header0 label [ 3804847.  3890011.]
id2range 85164
0 0
3804846
rand_mirror 1
(128,)
[20:03:42] src/engine/engine.cc:55: MXNet start using engine: ThreadedEnginePerDevice
loading bin 1000
loading bin 2000
loading bin 3000
loading bin 4000
loading bin 5000
loading bin 6000
loading bin 7000
loading bin 8000
loading bin 9000
loading bin 10000
loading bin 11000
loading bin 12000
(12000L, 3L, 112L, 112L)
ver lfw
loading bin 1000
loading bin 2000
loading bin 3000
loading bin 4000
loading bin 5000
loading bin 6000
loading bin 7000
loading bin 8000
loading bin 9000
loading bin 10000
loading bin 11000
loading bin 12000
loading bin 13000
loading bin 14000
(14000L, 3L, 112L, 112L)
ver cfp_ff
loading bin 1000
loading bin 2000
loading bin 3000
loading bin 4000
loading bin 5000
loading bin 6000
loading bin 7000
loading bin 8000
loading bin 9000
loading bin 10000
loading bin 11000
loading bin 12000
loading bin 13000
loading bin 14000
(14000L, 3L, 112L, 112L)
ver cfp_fp
loading bin 1000
loading bin 2000
loading bin 3000
loading bin 4000
loading bin 5000
loading bin 6000
loading bin 7000
loading bin 8000
loading bin 9000
loading bin 10000
loading bin 11000
loading bin 12000
(12000L, 3L, 112L, 112L)
ver agedb_30
lr_steps [400000, 560000, 640000]
[20:04:18] src/operator/././cudnn_algoreg-inl.h:107: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
/usr/local/lib/python2.7/dist-packages/mxnet/module/base_module.py:466: UserWarning: Optimizer created manually outside Module but rescale_grad is not normalized to 1.0/batch_size/num_workers (1.0 vs. 0.0078125). Is this intended?
  optimizer_params=optimizer_params)
call reset()
[20:04:30] /home/travis/build/dmlc/mxnet-distro/mxnet-build/dmlc-core/include/dmlc/logging.h:308: [20:04:30] src/storage/./pooled_storage_manager.h:107: cudaMalloc failed: out of memory

Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x28e5ac) [0x7f60919ef5ac]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x28bb72f) [0x7f609401c72f]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x28bf958) [0x7f6094020958]
[bt] (3) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x525d6f) [0x7f6091c86d6f]
[bt] (4) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x23ce416) [0x7f6093b2f416]
[bt] (5) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x23d2600) [0x7f6093b33600]
[bt] (6) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x23b42cd) [0x7f6093b152cd]
[bt] (7) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x23b8a8b) [0x7f6093b19a8b]
[bt] (8) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x23b8c66) [0x7f6093b19c66]
[bt] (9) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x23b553b) [0x7f6093b1653b]

[20:04:30] /home/travis/build/dmlc/mxnet-distro/mxnet-build/dmlc-core/include/dmlc/logging.h:308: [20:04:30] src/engine/./threaded_engine.h:359: [20:04:30] src/storage/./pooled_storage_manager.h:107: cudaMalloc failed: out of memory

Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x28e5ac) [0x7f60919ef5ac]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x28bb72f) [0x7f609401c72f]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x28bf958) [0x7f6094020958]
[bt] (3) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x525d6f) [0x7f6091c86d6f]
[bt] (4) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x23ce416) [0x7f6093b2f416]
[bt] (5) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x23d2600) [0x7f6093b33600]
[bt] (6) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x23b42cd) [0x7f6093b152cd]
[bt] (7) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x23b8a8b) [0x7f6093b19a8b]
[bt] (8) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x23b8c66) [0x7f6093b19c66]
[bt] (9) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x23b553b) [0x7f6093b1653b]

A fatal error occurred in asynchronous engine operation. If you do not know what caused this error, you can try set environment variable MXNET_ENGINE_TYPE to NaiveEngine and run with debugger (i.e. gdb). This will force all operations to be synchronous and backtrace will give you the series of calls that lead to this error. Remember to set MXNET_ENGINE_TYPE back to empty after debugging.

Stack trace returned 8 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x28e5ac) [0x7f60919ef5ac]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x23b4574) [0x7f6093b15574]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x23b8a8b) [0x7f6093b19a8b]
[bt] (3) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x23b8c66) [0x7f6093b19c66]
[bt] (4) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x23b553b) [0x7f6093b1653b]
[bt] (5) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f60d4d20c80]
[bt] (6) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f60e2b486ba]
[bt] (7) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f60e287e41d]

terminate called after throwing an instance of 'dmlc::Error'
  what():  [20:04:30] src/engine/./threaded_engine.h:359: [20:04:30] src/storage/./pooled_storage_manager.h:107: cudaMalloc failed: out of memory

Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x28e5ac) [0x7f60919ef5ac]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x28bb72f) [0x7f609401c72f]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x28bf958) [0x7f6094020958]
[bt] (3) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x525d6f) [0x7f6091c86d6f]
[bt] (4) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x23ce416) [0x7f6093b2f416]
[bt] (5) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x23d2600) [0x7f6093b33600]
[bt] (6) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x23b42cd) [0x7f6093b152cd]
[bt] (7) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x23b8a8b) [0x7f6093b19a8b]
[bt] (8) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x23b8c66) [0x7f6093b19c66]
[bt] (9) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x23b553b) [0x7f6093b1653b]

A fatal error occurred in asynchronous engine operation. If you do not know what caused this error, you can try set environment variable MXNET_ENGINE_TYPE to NaiveEngine and run with debugger (i.e. gdb). This will force all operations to be synchronous and backtrace will give you the series of calls that lead to this error. Remember to set MXNET_ENGINE_TYPE back to empty after debugging.

Stack trace returned 8 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x28e5ac) [0x7f60919ef5ac]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x23b4574) [0x7f6093b15574]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x23b8a8b) [0x7f6093b19a8b]
[bt] (3) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x23b8c66) [0x7f6093b19c66]
[bt] (4) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x23b553b) [0x7f6093b1653b]
[bt] (5) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f60d4d20c80]
[bt] (6) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f60e2b486ba]
[bt] (7) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f60e287e41d]

Aborted (core dumped)

Can you include more detail in the README about the specific requirements in terms of devices, memory, and CUDA requirements?

The text was updated successfully, but these errors were encountered:

nttstar · 2018-02-21T09:35:17Z

Refer to #32 if you do not have 24GB memory on each GPU, or use smaller --per-batch-size

lizhenstat · 2021-08-11T06:55:51Z

Hi, I have followed the instruction to install the environment (python2.7, cu9.0) and when running the scripts,

CUDA_VISIBLE_DEVICES=0 python -u train.py --network r50 --loss arcface --dataset emore --per-batch-size 8

I have encountered the following problem too(I changed the --per-batch-size to 8 and network to r50 and this problem still exits)

/home/sirius/anaconda3/envs/insightFacePre/lib/python2.7/site-packages/mxnet/numpy_op_signature.py:61: UserWarning: Some mxnet.numpy operator signatures may not be displayed consistently with their counterparts in the official NumPy package due to too-low Python version 2.7.18 |Anaconda, Inc.| (default, Jun  4 2021, 14:47:46)
[GCC 7.3.0]. Python >= 3.5 is required to make the signatures display correctly.
  .format(str(sys.version)))
gpu num: 1
prefix ./models/r50-arcface-emore/model
image_size [112, 112]
num_classes 85742
Called with argument: Namespace(batch_size=8, ckpt=3, ctx_num=1, dataset='emore', frequent=20, image_channel=3, kvstore='device', loss='arcface', lr=0.1, lr_steps='100000,160000,220000', models_root='./models', mom=0.9, network='r50', per_batch_size=8, pretrained='', pretrained_epoch=1, rescale_threshold=0, verbose=2000, wd=0.0005) {'loss_m1': 1.0, 'loss_m2': 0.5, 'loss_m3': 0.0, 'net_act': 'prelu', 'emb_size': 512, 'data_rand_mirror': True, 'num_layers': 50, 'loss_name': 'margin_softmax', 'val_targets': ['lfw', 'cfp_fp', 'agedb_30'], 'ce_loss': True, 'net_input': 1, 'image_shape': [112, 112, 3], 'net_blocks': [1, 4, 6, 2], 'fc7_lr_mult': 1.0, 'ckpt_embedding': True, 'net_unit': 3, 'net_output': 'E', 'count_flops': True, 'num_workers': 1, 'batch_size': 8, 'memonger': False, 'data_images_filter': 0, 'dataset': 'emore', 'num_classes': 85742, 'fc7_no_bias': False, 'loss': 'arcface', 'data_color': 0, 'loss_s': 64.0, 'dataset_path': '~/document/siriusShare/Clustering-Face/Data/faces_emore', 'data_cutoff': False, 'net_se': 0, 'net_multiplier': 1.0, 'fc7_wd_mult': 1.0, 'network': 'r50', 'per_batch_size': 8, 'net_name': 'fresnet', 'workspace': 256, 'max_steps': 0, 'bn_mom': 0.9}
0 1 E 3 prelu False
Network FLOPs: 12.6G
INFO:root:loading recordio ~/document/siriusShare/Clustering-Face/Data/faces_emore/train.rec...
Traceback (most recent call last):
  File "train.py", line 377, in <module>
    main()
  File "train.py", line 374, in main
    train_net(args)
  File "train.py", line 242, in train_net
    images_filter        = config.data_images_filter,
  File "/home/sirius/document/siriusShare/Clustering-Face/insightface-master/recognition/image_iter.py", line 38, in __init__
    self.imgrec = recordio.MXIndexedRecordIO(path_imgidx, path_imgrec, 'r')  # pylint: disable=redefined-variable-type
  File "/home/sirius/anaconda3/envs/insightFacePre/lib/python2.7/site-packages/mxnet/recordio.py", line 245, in __init__
    super(MXIndexedRecordIO, self).__init__(uri, flag)
  File "/home/sirius/anaconda3/envs/insightFacePre/lib/python2.7/site-packages/mxnet/recordio.py", line 71, in __init__
    self.open()
  File "/home/sirius/anaconda3/envs/insightFacePre/lib/python2.7/site-packages/mxnet/recordio.py", line 248, in open
    super(MXIndexedRecordIO, self).open()
  File "/home/sirius/anaconda3/envs/insightFacePre/lib/python2.7/site-packages/mxnet/recordio.py", line 79, in open
    check_call(_LIB.MXRecordIOReaderCreate(self.uri, ctypes.byref(self.handle)))
  File "/home/sirius/anaconda3/envs/insightFacePre/lib/python2.7/site-packages/mxnet/base.py", line 255, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [14:54:34] src/io/local_filesys.cc:209: Check failed: allow_null:  LocalFileSystem::Open "~/document/siriusShare/Clustering-Face/Data/faces_emore/train.rec": No such file or directory
Stack trace:
  [bt] (0) /home/sirius/anaconda3/envs/insightFacePre/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x40ff258) [0x7fe08a76e258]
  [bt] (1) /home/sirius/anaconda3/envs/insightFacePre/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x40f6bca) [0x7fe08a765bca]
  [bt] (2) /home/sirius/anaconda3/envs/insightFacePre/lib/python2.7/site-packages/mxnet/libmxnet.so(MXRecordIOReaderCreate+0x2d) [0x7fe089d82dad]
  [bt] (3) /home/sirius/anaconda3/envs/insightFacePre/lib/python2.7/lib-dynload/../../libffi.so.7(+0x69dd) [0x7fe0e4ac09dd]
  [bt] (4) /home/sirius/anaconda3/envs/insightFacePre/lib/python2.7/lib-dynload/../../libffi.so.7(+0x6067) [0x7fe0e4ac0067]
  [bt] (5) /home/sirius/anaconda3/envs/insightFacePre/lib/python2.7/lib-dynload/_ctypes.so(_ctypes_callproc+0x4de) [0x7fe0e456a9de]
  [bt] (6) /home/sirius/anaconda3/envs/insightFacePre/lib/python2.7/lib-dynload/_ctypes.so(+0x9b61) [0x7fe0e4560b61]
  [bt] (7) /home/sirius/anaconda3/envs/insightFacePre/bin/../lib/libpython2.7.so.1.0(PyObject_Call+0x43) [0x7fe0e5f0db83]
  [bt] (8) /home/sirius/anaconda3/envs/insightFacePre/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x3bb9) [0x7fe0e5fa4199]


[14:54:34] src/engine/engine.cc:55: MXNet start using engine: ThreadedEnginePerDevice

Thanks for your time and any help would be appreciated!

nttstar · 2021-08-11T08:26:36Z

"~/document/siriusShare/Clustering-Face/Data/faces_emore/train.rec": No such file or directory

lizhenstat · 2021-08-12T03:13:15Z

@nttstar Hi, thanks for your quick reply, I have successfully run the torch version code on one GPU from the comment
https://github.com/deepinsight/insightface/issues/1699#issue-963165509.

matthew-mcateer changed the title ~~Memory Errors~~ Memory Errors, would appreciate more detailed requirement info in README Feb 19, 2018

nttstar closed this as completed Feb 22, 2018

FlyingAle mentioned this issue Jun 2, 2023

2d106det convert to ncnn model,run on android will be crushed #2322

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory Errors, would appreciate more detailed requirement info in README #55

Memory Errors, would appreciate more detailed requirement info in README #55

matthew-mcateer commented Feb 19, 2018

nttstar commented Feb 21, 2018

lizhenstat commented Aug 11, 2021

nttstar commented Aug 11, 2021

lizhenstat commented Aug 12, 2021

Memory Errors, would appreciate more detailed requirement info in README #55

Memory Errors, would appreciate more detailed requirement info in README #55

Comments

matthew-mcateer commented Feb 19, 2018

nttstar commented Feb 21, 2018

lizhenstat commented Aug 11, 2021

nttstar commented Aug 11, 2021

lizhenstat commented Aug 12, 2021