You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.
hi, guys, I recently tried to run the yolov3 example code provided in 07. Train YOLOv3 on PASCAL VOC with my own dataset. The CPU version runs pretty smoothly (comment the GPU context), but the GPU version runs with problems, saying that the workspace size is not enough. I tried to reduce the batch_size from 16 to 8,4,2,1 and this error occurs constantly. In fact, using watch -n 1 nvidia-smi, the GPU memory was only less than 1GB out of 12GB during the whole running process. I was wondering whether my mxnet-cu80 installation was alright, so I run the validation example code a = mx.nd.ones((2, 3), mx.gpu()), it took me like 10min(very long) to able to input the next line 'b = a * 2 + 1'. Anyway, the results showed that my code can run in GPU context and I don't know what is wrong with this whole situation.
Error Message
[18:29:52] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (set the environment variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
Traceback (most recent call last):
File "train_yolo3.py", line 402, in
train(net, train_data, val_data, eval_metric, ctx, args)
File "train_yolo3.py", line 313, in train
obj_metrics.update(0, obj_losses)
File "/home/disk2/zhangyingwen/anaconda2/envs/newMx/lib/python3.6/site-packages/mxnet/metric.py", line 1636, in update
loss = ndarray.sum(pred).asscalar()
File "/home/disk2/zhangyingwen/anaconda2/envs/newMx/lib/python3.6/site-packages/mxnet/ndarray/ndarray.py", line 2014, in asscalar
return self.asnumpy()[0]
File "/home/disk2/zhangyingwen/anaconda2/envs/newMx/lib/python3.6/site-packages/mxnet/ndarray/ndarray.py", line 1996, in asnumpy
ctypes.c_size_t(data.size)))
File "/home/disk2/zhangyingwen/anaconda2/envs/newMx/lib/python3.6/site-packages/mxnet/base.py", line 253, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [18:29:52] src/operator/nn/./cudnn/cudnn_convolution-inl.h:948: Failed to find any forward convolution algorithm. with workspace size of 1073741824 bytes, please consider reducing batch/model size or increasing the workspace size
Stack trace:
[bt] (0) /home/disk2/zhangyingwen/anaconda2/envs/newMx/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x4958fb) [0x7fd86c5468fb]
[bt] (1) /home/disk2/zhangyingwen/anaconda2/envs/newMx/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x319b5b7) [0x7fd86f24c5b7]
[bt] (2) /home/disk2/zhangyingwen/anaconda2/envs/newMx/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x319f335) [0x7fd86f250335]
[bt] (3) /home/disk2/zhangyingwen/anaconda2/envs/newMx/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x318d514) [0x7fd86f23e514]
[bt] (4) /home/disk2/zhangyingwen/anaconda2/envs/newMx/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x318d9ce) [0x7fd86f23e9ce]
[bt] (5) /home/disk2/zhangyingwen/anaconda2/envs/newMx/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x318e352) [0x7fd86f23f352]
[bt] (6) /home/disk2/zhangyingwen/anaconda2/envs/newMx/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x318fa43) [0x7fd86f240a43]
[bt] (7) /home/disk2/zhangyingwen/anaconda2/envs/newMx/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x3195159) [0x7fd86f246159]
[bt] (8) /home/disk2/zhangyingwen/anaconda2/envs/newMx/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::imperative::PushFCompute(std::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocatormxnet::TBlob > const&, std::vector<mxnet::OpReqType, std::allocatormxnet::OpReqType > const&, std::vector<mxnet::TBlob, std::allocatormxnet::TBlob > const&)> const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocatormxnet::engine::Var* > const&, std::vector<mxnet::engine::Var*, std::allocatormxnet::engine::Var* > const&, std::vector<mxnet::Resource, std::allocatormxnet::Resource > const&, std::vector<mxnet::NDArray*, std::allocatormxnet::NDArray* > const&, std::vector<mxnet::NDArray*, std::allocatormxnet::NDArray* > const&, std::vector<unsigned int, std::allocator > const&, std::vector<mxnet::OpReqType, std::allocatormxnet::OpReqType > const&)::{lambda(mxnet::RunContext)#1}::operator()(mxnet::RunContext) const+0x307) [0x7fd86e6fe597]
What have you tried to solve it?
change the batch_size from 16 to 8,4,2,1
resize the images from 608 to 416
check the mxnet environment, try old version pip install mxnet-cu80=1.6,1.5,1.4,1.0
I also look up some similar issues with approximately the same error install two mxnet Nvidia driver don't match mxnet version try old version of Mxnet
None of these solutions above solve my problem and I have been stuck in this problem for two days, so please help me if you happen to encounter the same error, THANK YOU!
I think it may be related to the driver.
Is it convenient to update the nvidia driver and cuda into the version 10.1 ?
Thank you for replying, it is not easy to persuade my boss to give me the access, but it would be my final choice if nothing more can be done. I plan to update cuda8.0 to 9.0 first with the current driver.
I update my cuda to 9.0 version and also cudnn to 7.6 (follow the instruction here )
Then I run the mxnet example code a = mx.nd.ones((2, 3), mx.gpu()), only a warning goes like this mxnet has been built against cuda library version 9000, which is older than the oldest version tested by CI (7600). Set MXNET_CUDNN_LIB_CHECKING=0 to quiet this warning.
I overlook this warning and try to run my yolo.py, and there is another error Check failed: compileResult == NVRTC_SUCCESS (7 vs. 0) : NVRTC Compilation failed. Please set environment variable MXNET_USE_FUSION to 0
I type export MXNET_USE_FUSION=0 and run my .py again and it turns out
PROBLEM SOLVED!
Description
hi, guys, I recently tried to run the yolov3 example code provided in 07. Train YOLOv3 on PASCAL VOC with my own dataset. The CPU version runs pretty smoothly (comment the GPU context), but the GPU version runs with problems, saying that the workspace size is not enough. I tried to reduce the batch_size from 16 to 8,4,2,1 and this error occurs constantly. In fact, using
watch -n 1 nvidia-smi
, the GPU memory was only less than 1GB out of 12GB during the whole running process. I was wondering whether my mxnet-cu80 installation was alright, so I run the validation example codea = mx.nd.ones((2, 3), mx.gpu())
, it took me like 10min(very long) to able to input the next line 'b = a * 2 + 1'. Anyway, the results showed that my code can run in GPU context and I don't know what is wrong with this whole situation.Error Message
[18:29:52] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (set the environment variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
Traceback (most recent call last):
File "train_yolo3.py", line 402, in
train(net, train_data, val_data, eval_metric, ctx, args)
File "train_yolo3.py", line 313, in train
obj_metrics.update(0, obj_losses)
File "/home/disk2/zhangyingwen/anaconda2/envs/newMx/lib/python3.6/site-packages/mxnet/metric.py", line 1636, in update
loss = ndarray.sum(pred).asscalar()
File "/home/disk2/zhangyingwen/anaconda2/envs/newMx/lib/python3.6/site-packages/mxnet/ndarray/ndarray.py", line 2014, in asscalar
return self.asnumpy()[0]
File "/home/disk2/zhangyingwen/anaconda2/envs/newMx/lib/python3.6/site-packages/mxnet/ndarray/ndarray.py", line 1996, in asnumpy
ctypes.c_size_t(data.size)))
File "/home/disk2/zhangyingwen/anaconda2/envs/newMx/lib/python3.6/site-packages/mxnet/base.py", line 253, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [18:29:52] src/operator/nn/./cudnn/cudnn_convolution-inl.h:948: Failed to find any forward convolution algorithm. with workspace size of 1073741824 bytes, please consider reducing batch/model size or increasing the workspace size
Stack trace:
[bt] (0) /home/disk2/zhangyingwen/anaconda2/envs/newMx/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x4958fb) [0x7fd86c5468fb]
[bt] (1) /home/disk2/zhangyingwen/anaconda2/envs/newMx/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x319b5b7) [0x7fd86f24c5b7]
[bt] (2) /home/disk2/zhangyingwen/anaconda2/envs/newMx/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x319f335) [0x7fd86f250335]
[bt] (3) /home/disk2/zhangyingwen/anaconda2/envs/newMx/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x318d514) [0x7fd86f23e514]
[bt] (4) /home/disk2/zhangyingwen/anaconda2/envs/newMx/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x318d9ce) [0x7fd86f23e9ce]
[bt] (5) /home/disk2/zhangyingwen/anaconda2/envs/newMx/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x318e352) [0x7fd86f23f352]
[bt] (6) /home/disk2/zhangyingwen/anaconda2/envs/newMx/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x318fa43) [0x7fd86f240a43]
[bt] (7) /home/disk2/zhangyingwen/anaconda2/envs/newMx/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x3195159) [0x7fd86f246159]
[bt] (8) /home/disk2/zhangyingwen/anaconda2/envs/newMx/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::imperative::PushFCompute(std::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocatormxnet::TBlob > const&, std::vector<mxnet::OpReqType, std::allocatormxnet::OpReqType > const&, std::vector<mxnet::TBlob, std::allocatormxnet::TBlob > const&)> const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocatormxnet::engine::Var* > const&, std::vector<mxnet::engine::Var*, std::allocatormxnet::engine::Var* > const&, std::vector<mxnet::Resource, std::allocatormxnet::Resource > const&, std::vector<mxnet::NDArray*, std::allocatormxnet::NDArray* > const&, std::vector<mxnet::NDArray*, std::allocatormxnet::NDArray* > const&, std::vector<unsigned int, std::allocator > const&, std::vector<mxnet::OpReqType, std::allocatormxnet::OpReqType > const&)::{lambda(mxnet::RunContext)#1}::operator()(mxnet::RunContext) const+0x307) [0x7fd86e6fe597]
What have you tried to solve it?
pip install mxnet-cu80=1.6,1.5,1.4,1.0
I also look up some similar issues with approximately the same error
install two mxnet
Nvidia driver don't match mxnet version
try old version of Mxnet
None of these solutions above solve my problem and I have been stuck in this problem for two days, so please help me if you happen to encounter the same error, THANK YOU!
Environment
Ubuntu 16.04, TITAN V, Nvidia driver 396.24.10, CUDA 8.0.61, cuDNN 6.0.21, mxnet-cu80
The text was updated successfully, but these errors were encountered: