Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Can't run YOLOv3 in gpu, not enough workspace size #18524

Closed
smileyzyw opened this issue Jun 9, 2020 · 3 comments
Closed

Can't run YOLOv3 in gpu, not enough workspace size #18524

smileyzyw opened this issue Jun 9, 2020 · 3 comments
Labels

Comments

@smileyzyw
Copy link

Description

hi, guys, I recently tried to run the yolov3 example code provided in 07. Train YOLOv3 on PASCAL VOC with my own dataset. The CPU version runs pretty smoothly (comment the GPU context), but the GPU version runs with problems, saying that the workspace size is not enough. I tried to reduce the batch_size from 16 to 8,4,2,1 and this error occurs constantly. In fact, using watch -n 1 nvidia-smi, the GPU memory was only less than 1GB out of 12GB during the whole running process. I was wondering whether my mxnet-cu80 installation was alright, so I run the validation example code a = mx.nd.ones((2, 3), mx.gpu()), it took me like 10min(very long) to able to input the next line 'b = a * 2 + 1'. Anyway, the results showed that my code can run in GPU context and I don't know what is wrong with this whole situation.

Error Message

[18:29:52] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (set the environment variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
Traceback (most recent call last):
File "train_yolo3.py", line 402, in
train(net, train_data, val_data, eval_metric, ctx, args)
File "train_yolo3.py", line 313, in train
obj_metrics.update(0, obj_losses)
File "/home/disk2/zhangyingwen/anaconda2/envs/newMx/lib/python3.6/site-packages/mxnet/metric.py", line 1636, in update
loss = ndarray.sum(pred).asscalar()
File "/home/disk2/zhangyingwen/anaconda2/envs/newMx/lib/python3.6/site-packages/mxnet/ndarray/ndarray.py", line 2014, in asscalar
return self.asnumpy()[0]
File "/home/disk2/zhangyingwen/anaconda2/envs/newMx/lib/python3.6/site-packages/mxnet/ndarray/ndarray.py", line 1996, in asnumpy
ctypes.c_size_t(data.size)))
File "/home/disk2/zhangyingwen/anaconda2/envs/newMx/lib/python3.6/site-packages/mxnet/base.py", line 253, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [18:29:52] src/operator/nn/./cudnn/cudnn_convolution-inl.h:948: Failed to find any forward convolution algorithm. with workspace size of 1073741824 bytes, please consider reducing batch/model size or increasing the workspace size
Stack trace:
[bt] (0) /home/disk2/zhangyingwen/anaconda2/envs/newMx/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x4958fb) [0x7fd86c5468fb]
[bt] (1) /home/disk2/zhangyingwen/anaconda2/envs/newMx/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x319b5b7) [0x7fd86f24c5b7]
[bt] (2) /home/disk2/zhangyingwen/anaconda2/envs/newMx/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x319f335) [0x7fd86f250335]
[bt] (3) /home/disk2/zhangyingwen/anaconda2/envs/newMx/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x318d514) [0x7fd86f23e514]
[bt] (4) /home/disk2/zhangyingwen/anaconda2/envs/newMx/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x318d9ce) [0x7fd86f23e9ce]
[bt] (5) /home/disk2/zhangyingwen/anaconda2/envs/newMx/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x318e352) [0x7fd86f23f352]
[bt] (6) /home/disk2/zhangyingwen/anaconda2/envs/newMx/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x318fa43) [0x7fd86f240a43]
[bt] (7) /home/disk2/zhangyingwen/anaconda2/envs/newMx/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x3195159) [0x7fd86f246159]
[bt] (8) /home/disk2/zhangyingwen/anaconda2/envs/newMx/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::imperative::PushFCompute(std::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocatormxnet::TBlob > const&, std::vector<mxnet::OpReqType, std::allocatormxnet::OpReqType > const&, std::vector<mxnet::TBlob, std::allocatormxnet::TBlob > const&)> const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocatormxnet::engine::Var* > const&, std::vector<mxnet::engine::Var*, std::allocatormxnet::engine::Var* > const&, std::vector<mxnet::Resource, std::allocatormxnet::Resource > const&, std::vector<mxnet::NDArray*, std::allocatormxnet::NDArray* > const&, std::vector<mxnet::NDArray*, std::allocatormxnet::NDArray* > const&, std::vector<unsigned int, std::allocator > const&, std::vector<mxnet::OpReqType, std::allocatormxnet::OpReqType > const&)::{lambda(mxnet::RunContext)#1}::operator()(mxnet::RunContext) const+0x307) [0x7fd86e6fe597]

What have you tried to solve it?

  1. change the batch_size from 16 to 8,4,2,1
  2. resize the images from 608 to 416
  3. check the mxnet environment, try old version pip install mxnet-cu80=1.6,1.5,1.4,1.0

I also look up some similar issues with approximately the same error
install two mxnet
Nvidia driver don't match mxnet version
try old version of Mxnet
None of these solutions above solve my problem and I have been stuck in this problem for two days, so please help me if you happen to encounter the same error, THANK YOU!

Environment

Ubuntu 16.04, TITAN V, Nvidia driver 396.24.10, CUDA 8.0.61, cuDNN 6.0.21, mxnet-cu80

@smileyzyw smileyzyw added the Bug label Jun 9, 2020
@wkcn
Copy link
Member

wkcn commented Jun 9, 2020

I think it may be related to the driver.
Is it convenient to update the nvidia driver and cuda into the version 10.1 ?

@smileyzyw smileyzyw reopened this Jun 9, 2020
@smileyzyw
Copy link
Author

I think it may be related to the driver.
Is it convenient to update the nvidia driver and cuda into the version 10.1 ?

Thank you for replying, it is not easy to persuade my boss to give me the access, but it would be my final choice if nothing more can be done. I plan to update cuda8.0 to 9.0 first with the current driver.

@smileyzyw
Copy link
Author

smileyzyw commented Jun 10, 2020

I update my cuda to 9.0 version and also cudnn to 7.6 (follow the instruction here )

Then I run the mxnet example code a = mx.nd.ones((2, 3), mx.gpu()), only a warning goes like
this mxnet has been built against cuda library version 9000, which is older than the oldest version tested by CI (7600). Set MXNET_CUDNN_LIB_CHECKING=0 to quiet this warning.
I overlook this warning and try to run my yolo.py, and there is another error
Check failed: compileResult == NVRTC_SUCCESS (7 vs. 0) : NVRTC Compilation failed. Please set environment variable MXNET_USE_FUSION to 0
I type export MXNET_USE_FUSION=0 and run my .py again and it turns out
PROBLEM SOLVED!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants