Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Errors, would appreciate more detailed requirement info in README #55

Closed
matthew-mcateer opened this issue Feb 19, 2018 · 4 comments

Comments

@matthew-mcateer
Copy link

After going through the instructions for adding the dataset, and adding the dependencies, and making sure I'm within the src folder in the repository, I enter the following to train InsightFace on LResNet100E-IR (this has been modified as my machine only has one GPU):

CUDA_VISIBLE_DEVICES='0' python -u train_softmax.py --network r100 --loss-type 4 --margin-m 0.5 --data-dir ../datasets/faces_ms1m_112x112 --prefix ../model-r100

However, I get the following output:

gpu num: 1
num_layers 100
image_size [112, 112]
num_classes 85164
Called with argument: Namespace(batch_size=128, beta=1000.0, beta_freeze=0, beta_min=5.0, c2c_mode=-10, c2c_threshold=0.0, center_alpha=0.5, center_scale=0.003, ckpt=1, coco_scale=8.676161173096705, ctx_num=1, data_dir='../datasets/faces_ms1m_112x112', easy_margin=0, emb_size=512, end_epoch=100000, gamma=0.12, image_channel=3, image_h=112, image_w=112, images_per_identity=0, incay=0.0, loss_type=4, lr=0.1, lr_steps='', margin=4, margin_m=0.5, margin_s=64.0, margin_verbose=0, max_steps=0, mom=0.9, network='r100', num_classes=85164, num_layers=100, output_c2c=0, patch='0_0_96_112_0', per_batch_size=128, power=1.0, prefix='../model-r100', pretrained='', rand_mirror=1, rescale_threshold=0, retrain=False, scale=0.9993, target='lfw,cfp_ff,cfp_fp,agedb_30', triplet_alpha=0.3, triplet_bag_size=3600, triplet_max_ap=0.0, use_deformable=0, use_val=False, verbose=2000, version_input=1, version_output='E', version_se=0, version_unit=3, wd=0.0005)
init resnet 100
0 1 E 3
INFO:root:loading recordio ../datasets/faces_ms1m_112x112/train.rec...
header0 label [ 3804847.  3890011.]
id2range 85164
0 0
3804846
rand_mirror 1
(128,)
[20:03:42] src/engine/engine.cc:55: MXNet start using engine: ThreadedEnginePerDevice
loading bin 1000
loading bin 2000
loading bin 3000
loading bin 4000
loading bin 5000
loading bin 6000
loading bin 7000
loading bin 8000
loading bin 9000
loading bin 10000
loading bin 11000
loading bin 12000
(12000L, 3L, 112L, 112L)
ver lfw
loading bin 1000
loading bin 2000
loading bin 3000
loading bin 4000
loading bin 5000
loading bin 6000
loading bin 7000
loading bin 8000
loading bin 9000
loading bin 10000
loading bin 11000
loading bin 12000
loading bin 13000
loading bin 14000
(14000L, 3L, 112L, 112L)
ver cfp_ff
loading bin 1000
loading bin 2000
loading bin 3000
loading bin 4000
loading bin 5000
loading bin 6000
loading bin 7000
loading bin 8000
loading bin 9000
loading bin 10000
loading bin 11000
loading bin 12000
loading bin 13000
loading bin 14000
(14000L, 3L, 112L, 112L)
ver cfp_fp
loading bin 1000
loading bin 2000
loading bin 3000
loading bin 4000
loading bin 5000
loading bin 6000
loading bin 7000
loading bin 8000
loading bin 9000
loading bin 10000
loading bin 11000
loading bin 12000
(12000L, 3L, 112L, 112L)
ver agedb_30
lr_steps [400000, 560000, 640000]
[20:04:18] src/operator/././cudnn_algoreg-inl.h:107: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
/usr/local/lib/python2.7/dist-packages/mxnet/module/base_module.py:466: UserWarning: Optimizer created manually outside Module but rescale_grad is not normalized to 1.0/batch_size/num_workers (1.0 vs. 0.0078125). Is this intended?
  optimizer_params=optimizer_params)
call reset()
[20:04:30] /home/travis/build/dmlc/mxnet-distro/mxnet-build/dmlc-core/include/dmlc/logging.h:308: [20:04:30] src/storage/./pooled_storage_manager.h:107: cudaMalloc failed: out of memory

Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x28e5ac) [0x7f60919ef5ac]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x28bb72f) [0x7f609401c72f]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x28bf958) [0x7f6094020958]
[bt] (3) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x525d6f) [0x7f6091c86d6f]
[bt] (4) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x23ce416) [0x7f6093b2f416]
[bt] (5) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x23d2600) [0x7f6093b33600]
[bt] (6) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x23b42cd) [0x7f6093b152cd]
[bt] (7) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x23b8a8b) [0x7f6093b19a8b]
[bt] (8) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x23b8c66) [0x7f6093b19c66]
[bt] (9) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x23b553b) [0x7f6093b1653b]

[20:04:30] /home/travis/build/dmlc/mxnet-distro/mxnet-build/dmlc-core/include/dmlc/logging.h:308: [20:04:30] src/engine/./threaded_engine.h:359: [20:04:30] src/storage/./pooled_storage_manager.h:107: cudaMalloc failed: out of memory

Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x28e5ac) [0x7f60919ef5ac]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x28bb72f) [0x7f609401c72f]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x28bf958) [0x7f6094020958]
[bt] (3) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x525d6f) [0x7f6091c86d6f]
[bt] (4) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x23ce416) [0x7f6093b2f416]
[bt] (5) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x23d2600) [0x7f6093b33600]
[bt] (6) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x23b42cd) [0x7f6093b152cd]
[bt] (7) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x23b8a8b) [0x7f6093b19a8b]
[bt] (8) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x23b8c66) [0x7f6093b19c66]
[bt] (9) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x23b553b) [0x7f6093b1653b]

A fatal error occurred in asynchronous engine operation. If you do not know what caused this error, you can try set environment variable MXNET_ENGINE_TYPE to NaiveEngine and run with debugger (i.e. gdb). This will force all operations to be synchronous and backtrace will give you the series of calls that lead to this error. Remember to set MXNET_ENGINE_TYPE back to empty after debugging.

Stack trace returned 8 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x28e5ac) [0x7f60919ef5ac]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x23b4574) [0x7f6093b15574]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x23b8a8b) [0x7f6093b19a8b]
[bt] (3) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x23b8c66) [0x7f6093b19c66]
[bt] (4) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x23b553b) [0x7f6093b1653b]
[bt] (5) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f60d4d20c80]
[bt] (6) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f60e2b486ba]
[bt] (7) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f60e287e41d]

terminate called after throwing an instance of 'dmlc::Error'
  what():  [20:04:30] src/engine/./threaded_engine.h:359: [20:04:30] src/storage/./pooled_storage_manager.h:107: cudaMalloc failed: out of memory

Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x28e5ac) [0x7f60919ef5ac]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x28bb72f) [0x7f609401c72f]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x28bf958) [0x7f6094020958]
[bt] (3) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x525d6f) [0x7f6091c86d6f]
[bt] (4) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x23ce416) [0x7f6093b2f416]
[bt] (5) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x23d2600) [0x7f6093b33600]
[bt] (6) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x23b42cd) [0x7f6093b152cd]
[bt] (7) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x23b8a8b) [0x7f6093b19a8b]
[bt] (8) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x23b8c66) [0x7f6093b19c66]
[bt] (9) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x23b553b) [0x7f6093b1653b]

A fatal error occurred in asynchronous engine operation. If you do not know what caused this error, you can try set environment variable MXNET_ENGINE_TYPE to NaiveEngine and run with debugger (i.e. gdb). This will force all operations to be synchronous and backtrace will give you the series of calls that lead to this error. Remember to set MXNET_ENGINE_TYPE back to empty after debugging.

Stack trace returned 8 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x28e5ac) [0x7f60919ef5ac]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x23b4574) [0x7f6093b15574]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x23b8a8b) [0x7f6093b19a8b]
[bt] (3) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x23b8c66) [0x7f6093b19c66]
[bt] (4) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x23b553b) [0x7f6093b1653b]
[bt] (5) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f60d4d20c80]
[bt] (6) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f60e2b486ba]
[bt] (7) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f60e287e41d]

Aborted (core dumped)

Can you include more detail in the README about the specific requirements in terms of devices, memory, and CUDA requirements?

@matthew-mcateer matthew-mcateer changed the title Memory Errors Memory Errors, would appreciate more detailed requirement info in README Feb 19, 2018
@nttstar
Copy link
Collaborator

nttstar commented Feb 21, 2018

Refer to #32 if you do not have 24GB memory on each GPU, or use smaller --per-batch-size

@nttstar nttstar closed this as completed Feb 22, 2018
@lizhenstat
Copy link

Hi, I have followed the instruction to install the environment (python2.7, cu9.0) and when running the scripts,

CUDA_VISIBLE_DEVICES=0 python -u train.py --network r50 --loss arcface --dataset emore --per-batch-size 8

I have encountered the following problem too(I changed the --per-batch-size to 8 and network to r50 and this problem still exits)

/home/sirius/anaconda3/envs/insightFacePre/lib/python2.7/site-packages/mxnet/numpy_op_signature.py:61: UserWarning: Some mxnet.numpy operator signatures may not be displayed consistently with their counterparts in the official NumPy package due to too-low Python version 2.7.18 |Anaconda, Inc.| (default, Jun  4 2021, 14:47:46)
[GCC 7.3.0]. Python >= 3.5 is required to make the signatures display correctly.
  .format(str(sys.version)))
gpu num: 1
prefix ./models/r50-arcface-emore/model
image_size [112, 112]
num_classes 85742
Called with argument: Namespace(batch_size=8, ckpt=3, ctx_num=1, dataset='emore', frequent=20, image_channel=3, kvstore='device', loss='arcface', lr=0.1, lr_steps='100000,160000,220000', models_root='./models', mom=0.9, network='r50', per_batch_size=8, pretrained='', pretrained_epoch=1, rescale_threshold=0, verbose=2000, wd=0.0005) {'loss_m1': 1.0, 'loss_m2': 0.5, 'loss_m3': 0.0, 'net_act': 'prelu', 'emb_size': 512, 'data_rand_mirror': True, 'num_layers': 50, 'loss_name': 'margin_softmax', 'val_targets': ['lfw', 'cfp_fp', 'agedb_30'], 'ce_loss': True, 'net_input': 1, 'image_shape': [112, 112, 3], 'net_blocks': [1, 4, 6, 2], 'fc7_lr_mult': 1.0, 'ckpt_embedding': True, 'net_unit': 3, 'net_output': 'E', 'count_flops': True, 'num_workers': 1, 'batch_size': 8, 'memonger': False, 'data_images_filter': 0, 'dataset': 'emore', 'num_classes': 85742, 'fc7_no_bias': False, 'loss': 'arcface', 'data_color': 0, 'loss_s': 64.0, 'dataset_path': '~/document/siriusShare/Clustering-Face/Data/faces_emore', 'data_cutoff': False, 'net_se': 0, 'net_multiplier': 1.0, 'fc7_wd_mult': 1.0, 'network': 'r50', 'per_batch_size': 8, 'net_name': 'fresnet', 'workspace': 256, 'max_steps': 0, 'bn_mom': 0.9}
0 1 E 3 prelu False
Network FLOPs: 12.6G
INFO:root:loading recordio ~/document/siriusShare/Clustering-Face/Data/faces_emore/train.rec...
Traceback (most recent call last):
  File "train.py", line 377, in <module>
    main()
  File "train.py", line 374, in main
    train_net(args)
  File "train.py", line 242, in train_net
    images_filter        = config.data_images_filter,
  File "/home/sirius/document/siriusShare/Clustering-Face/insightface-master/recognition/image_iter.py", line 38, in __init__
    self.imgrec = recordio.MXIndexedRecordIO(path_imgidx, path_imgrec, 'r')  # pylint: disable=redefined-variable-type
  File "/home/sirius/anaconda3/envs/insightFacePre/lib/python2.7/site-packages/mxnet/recordio.py", line 245, in __init__
    super(MXIndexedRecordIO, self).__init__(uri, flag)
  File "/home/sirius/anaconda3/envs/insightFacePre/lib/python2.7/site-packages/mxnet/recordio.py", line 71, in __init__
    self.open()
  File "/home/sirius/anaconda3/envs/insightFacePre/lib/python2.7/site-packages/mxnet/recordio.py", line 248, in open
    super(MXIndexedRecordIO, self).open()
  File "/home/sirius/anaconda3/envs/insightFacePre/lib/python2.7/site-packages/mxnet/recordio.py", line 79, in open
    check_call(_LIB.MXRecordIOReaderCreate(self.uri, ctypes.byref(self.handle)))
  File "/home/sirius/anaconda3/envs/insightFacePre/lib/python2.7/site-packages/mxnet/base.py", line 255, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [14:54:34] src/io/local_filesys.cc:209: Check failed: allow_null:  LocalFileSystem::Open "~/document/siriusShare/Clustering-Face/Data/faces_emore/train.rec": No such file or directory
Stack trace:
  [bt] (0) /home/sirius/anaconda3/envs/insightFacePre/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x40ff258) [0x7fe08a76e258]
  [bt] (1) /home/sirius/anaconda3/envs/insightFacePre/lib/python2.7/site-packages/mxnet/libmxnet.so(+0x40f6bca) [0x7fe08a765bca]
  [bt] (2) /home/sirius/anaconda3/envs/insightFacePre/lib/python2.7/site-packages/mxnet/libmxnet.so(MXRecordIOReaderCreate+0x2d) [0x7fe089d82dad]
  [bt] (3) /home/sirius/anaconda3/envs/insightFacePre/lib/python2.7/lib-dynload/../../libffi.so.7(+0x69dd) [0x7fe0e4ac09dd]
  [bt] (4) /home/sirius/anaconda3/envs/insightFacePre/lib/python2.7/lib-dynload/../../libffi.so.7(+0x6067) [0x7fe0e4ac0067]
  [bt] (5) /home/sirius/anaconda3/envs/insightFacePre/lib/python2.7/lib-dynload/_ctypes.so(_ctypes_callproc+0x4de) [0x7fe0e456a9de]
  [bt] (6) /home/sirius/anaconda3/envs/insightFacePre/lib/python2.7/lib-dynload/_ctypes.so(+0x9b61) [0x7fe0e4560b61]
  [bt] (7) /home/sirius/anaconda3/envs/insightFacePre/bin/../lib/libpython2.7.so.1.0(PyObject_Call+0x43) [0x7fe0e5f0db83]
  [bt] (8) /home/sirius/anaconda3/envs/insightFacePre/bin/../lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x3bb9) [0x7fe0e5fa4199]


[14:54:34] src/engine/engine.cc:55: MXNet start using engine: ThreadedEnginePerDevice

Thanks for your time and any help would be appreciated!

@nttstar
Copy link
Collaborator

nttstar commented Aug 11, 2021

"~/document/siriusShare/Clustering-Face/Data/faces_emore/train.rec": No such file or directory

@lizhenstat
Copy link

@nttstar Hi, thanks for your quick reply, I have successfully run the torch version code on one GPU from the comment
https://github.com/deepinsight/insightface/issues/1699#issue-963165509.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants