-
Notifications
You must be signed in to change notification settings - Fork 6.8k
training model failed after one epoch on GPU: e == CUDNN_STATUS_SUCCESS (8 vs. 0) cuDNN: CUDNN_STATUS_EXECUTION_FAILED #14799
Comments
@khui Not sure what's going on. But I can find you enable: |
Thanks @lanking520 for the answer! I am using Docker, and the mxnet being used is mxnet-cu92mkl. Do you mean I should instead use pip install mxnet_p36. The conda env where the docker container launched is |
In addition, I tried using naive engine by setting
|
As a note, when reproducing the errors in a jupyter notebook, I got following errors when trying to print out the loss and compute the mean (after getting the errros described earlier). @lanking520 Could you help to check the following error messages? Please let me know if they provide you any hints. Thanks!! The loss is.
|
@khui If you run the code and got error inside docker container, the issue is very likely to be related to container only. |
@mirocody Thanks! The errors appear when I am using DLAMI. To debug, I ran docker container to exclude the reasons that the bugs come from the mismatched mxnet/cuda/cudnn version. Since such reasons seem unlikely after trying different combinations, I switched back to use DLAMI. The container is ran using following command, thereafter some commands are ran inside the container as usual. The jupyer notebook is ran on mxnet_p36 per the suggestions from @lanking520. |
@khui In DLAMI, we support mxnet with cu90, so not sure the error you got is related to cu92 and cudnn. I would suggest to try latest DLAMI and run your code in conda env, if still not work, you can cut a ticket to us, while the community can still look this issue to see if this is related to mxnet framework. |
After setting the |
I got the similiar error |
@yuzhoujianxia thanks for reporting. Would you mind filling a new bug report for your case? |
smaller batch_size saved me. |
Description
When using gpu, the model could be trained for one epoch or so and then I got Check failed: e == CUDNN_STATUS_SUCCESS (8 vs. 0) cuDNN: CUDNN_STATUS_EXECUTION_FAILED. This error repeatedly appears when training the model on gpu, but at different time (epoch/batch), even from different line. The same model has been successfully trained on CPU. Any ideas about the possible reasons?
I also posted on the forum. I am not sure whether it is due to the misuses of the mxnet, or to some issues of mxnet. Sorry for the duplicated posts.
Environment info (Required)
What have you tried to solve it?
The text was updated successfully, but these errors were encountered: