Conv1D throws CUDNN_STATUS_EXECUTION_FAILED #11241

eric-haibin-lin · 2018-06-12T06:32:41Z

Setup:

$ pip install mxnet-cu90==1.1.0

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2017 NVIDIA Corporation
Built on Fri_Sep__1_21:08:03_CDT_2017
Cuda compilation tools, release 9.0, V9.0.176

$ ls /usr/local/cuda/lib64/libcudnn.so.7.0.3
/usr/local/cuda/lib64/libcudnn.so.7.0.3

Run the following script debug.py:

import mxnet as mx
W_REQ = 'add'
shape = (1, 65536, 1)
ctx = mx.gpu()
kwargs = {'no_bias': True, 'kernel': (1,), 'num_filter': 1}
x = mx.sym.var('x')
w = mx.sym.var('w')
x_grad = mx.nd.zeros(shape, ctx=ctx)
w_grad = mx.nd.zeros(shape, ctx=ctx)
args_grad = {'x': x_grad, 'w': w_grad}
sym = mx.sym.Convolution(x, w, **kwargs)
executor = sym.bind(ctx, grad_req={'x': 'null', 'w': W_REQ},
                    args={'x': mx.nd.ones(shape, ctx=ctx), 'w': mx.nd.ones(shape, ctx=ctx)},
                    args_grad=args_grad)
executor.forward()
executor.backward([mx.nd.ones((1,1,1), ctx=ctx)])
mx.nd.waitall()

Gives the following error:

[06:31:41] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:107: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
terminate called after throwing an instance of 'dmlc::Error'
  what():  [06:31:41] src/engine/./threaded_engine.h:359: [06:31:41] src/operator/nn/./cudnn/cudnn_convolution-inl.h:242: Check failed: e == CUDNN_STATUS_SUCCESS (8 vs. 0) cuDNN: CUDNN_STATUS_EXECUTION_FAILED

Stack trace returned 10 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x2a9e78) [0x7f705d9a3e78]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x2aa288) [0x7f705d9a4288]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x2a920d1) [0x7f706018c0d1]
[bt] (3) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x262f5e7) [0x7f705fd295e7]
[bt] (4) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x24570bb) [0x7f705fb510bb]
[bt] (5) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x245d7d4) [0x7f705fb577d4]
[bt] (6) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x243e2ed) [0x7f705fb382ed]
[bt] (7) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x2442bdb) [0x7f705fb3cbdb]
[bt] (8) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x2442db6) [0x7f705fb3cdb6]
[bt] (9) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x243f68b) [0x7f705fb3968b]


A fatal error occurred in asynchronous engine operation. If you do not know what caused this error, you can try set environment variable MXNET_ENGINE_TYPE to NaiveEngine and run with debugger (i.e. gdb). This will force all operations to be synchronous and backtrace will give you the series of calls that lead to this error. Remember to set MXNET_ENGINE_TYPE back to empty after debugging.

Stack trace returned 9 entries:
[bt] (0) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x2a9e78) [0x7f705d9a3e78]
[bt] (1) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x2aa288) [0x7f705d9a4288]
[bt] (2) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x243e594) [0x7f705fb38594]
[bt] (3) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x2442bdb) [0x7f705fb3cbdb]
[bt] (4) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x2442db6) [0x7f705fb3cdb6]
[bt] (5) /usr/local/lib/python2.7/dist-packages/mxnet/libmxnet.so(+0x243f68b) [0x7f705fb3968b]
[bt] (6) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f70d400bc80]
[bt] (7) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f70d52a66ba]
[bt] (8) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f70d4fdc3dd]

Note that there's no error if W_REQ is changed to 'write'.

Can also be reproduced if I build mxnet from source at commit 5b99b25 where Conv1D CUDNN was initially introduced.

The text was updated successfully, but these errors were encountered:

eric-haibin-lin · 2018-06-12T06:51:41Z

Can be also reproduced by the following code debug_gluon.py:

import mxnet as mx
from mxnet import nd, sym, autograd
from mxnet.gluon import nn, HybridBlock, Block

if __name__ == '__main__':
    ctx = mx.gpu()
    x = mx.nd.ones((1L, 65536L, 1560L), ctx=ctx)
    net = nn.Conv1D(channels=256, kernel_size=2, layout='NCW', use_bias=False)
    net.initialize(ctx=ctx)

    for p in net.collect_params().values():
        p.grad_req = 'add'

    with autograd.record():
        y = net(x)
    y.backward()
    print(net.weight.grad())

with pip install mxnet-cu90 --pre

DickJC123 · 2018-06-12T17:32:37Z

What GPU are you trying to run on? What were the nvcc args used to build your libmxnet.so?

eric-haibin-lin · 2018-06-12T18:45:49Z

Tesla V100.

git checkout 5b99b25e5f6ab3a20c7bcf4821a6af0a1a95f823
git submodule update --init --recursive 
cp make/config.mk .
echo "USE_BLAS=openblas" >>config.mk
echo "ADD_CFLAGS += -I/usr/include/openblas" >>config.mk
echo "USE_CUDA=1" >>config.mk
echo "USE_CUDA_PATH=/usr/local/cuda" >>config.mk
echo "USE_CUDNN=1" >>config.mk
make -j32

Run python debug.py

eric-haibin-lin · 2018-06-18T18:49:49Z

Update: CUDNN team is notified for the issue that cudnnFind() is returning algos that will fail.

* adding param for list of tags to display on website * using new website display argument for artifact placement in version folder * adding display logic * remove restricted setting for testing * update usage instructions * reverted Jenkinsfile to use restricted nodes [MXAPPS-581] Fixes for broken Straight Dope tests. (apache#11923) * Update relative paths pointing to the data directory to point to the correct place in the testing temporary folder. * Enable the notebooks that were previously broken because of relative file paths not pointing to the correct place. * Move some notebooks we do not plan to test to the whitelist. These notebooks are not published in the Straight Dope book. * Clean-up: Convert print statements to info/warn/error logging statements. Add some logging statements for better status. Disable flaky test: test_spatial_transformer_with_type (apache#11930) apache#11839 Add linux and macos MKLDNN Building Instruction (apache#11049) * add linux and macos doc * update doc * Update MKL_README.md * Update MKL_README.md Add convolution code to verify mkldnn backend * add homebrew link * rename to MKLDNN_README * add mkl verify * trigger * trigger * set mac complier to gcc47 * add VS2017 support experimentally * improve quality * improve quality * modify mac build instruction since prepare_mkldnn.sh has been rm * trigger * add some improvement [MXNET-531] Add download util (apache#11866) * add changes to example * place the file to the util * add retry scheme * fix the retry logic * change the DownloadUtil to Util * Trigger the CI [MXNET-11241] Avoid use of troublesome cudnnFind() results when grad_req='add' (apache#11338) * Add tests that fail due to issue 11241 * Fix apache#11241 Conv1D throws CUDNN_STATUS_EXECUTION_FAILED * Force algo 1 when grad_req==add with large c. Expand tests. * Shorten test runtimes. Improving documentation and error messages for Async distributed training with Gluon (apache#11910) * Add description about update on kvstore * add async check for gluon * only raise error if user set update_on_kvstore * fix condition * add async nightly test * fix case when no kvstore * add example for trainer creation in doc [MXNET-641] fix R windows install docs (apache#11805) * fix R windows install docs * addressed PR comments * PR comments * PR comments * fixed line wrappings * fixed line wrappings a hot fix for mkldnn link (apache#11939) re-enabling randomized test_l2_normalization (apache#11900) [MXNET-651] MXNet Model Backwards Compatibility Checker (apache#11626) * Added MNIST-MLP-Module-API models to check model save and load_checkpoint methods * Added LENET with Conv2D operator training file * Added LENET with Conv2d operator inference file * Added LanguageModelling with RNN training file * Added LamguageModelling with RNN inference file * Added hybridized LENET Gluon Model training file * Added hybridized LENET gluon model inference file * Added license headers * Refactored the model and inference files and extracted out duplicate code in a common file * Added runtime function for executing the MBCC files * Added JenkinsFile for MBCC to be run as a nightly job * Added boto3 install for s3 uploads * Added README for MBCC * Added license header * Added more common functions from lm_rnn_gluon_train and inference files into common.py to clean up code * Added scripts for training models on older versions of MXNet * Added check for preventing inference script from crashing in case no trained models are found * Fixed indentation issue * Replaced Penn Tree Bank Dataset with Sherlock Holmes Dataset * Fixed indentation issue * Removed training in models and added smaller models. Now we are simply checking a forward pass in the model with dummy data. * Updated README * Fixed indentation error * Fixed indentation error * Removed code duplication in the training file * Added comments for runtime_functions script for training files * Merged S3 Buckets for storing data and models into one * Automated the process to fetch MXNet versions from git tags * Added defensive checks for the case where the data might not be found * Fixed issue where we were performing inference on state model files * Replaced print statements with logging ones * Removed boto install statements and move them into ubuntu_python docker * Separated training and uploading of models into separate files so that training runs in Docker and upload runs outside Docker * Fixed pylint warnings * Updated comments and README * Removed the venv for training process * Fixed indentation in the MBCC Jenkins file and also separated out training and inference into two separate stages * Fixed indendation * Fixed erroneous single quote * Added --user flag to check for Jenkins error * Removed unused methods * Added force flag in the pip command to install mxnet * Removed the force-re-install flag * Changed exit 1 to exit 0 * Added quotes around the shell command * added packlibs and unpack libs for MXNet builds * Changed PythonPath from relative to absolute * Created dedicated bucket with correct permission * Fix for python path in training * Changed bucket name to CI bucket * Added set -ex to the upload shell script * Now raising an exception if no models are found in the S3 bucket * Added regex to train models script * Added check for performing inference only on models trained on same major versions * Added set -ex flags to shell scripts * Added multi-version regex checks in training * Fixed typo in regex * Now we will train models for all the minor versions for a given major version by traversing the tags * Added check for validating current_version [MXNET-531] NeuralStyle Example for Scala (apache#11621) * add initial neuralstyle and test coverage * Add two more test and README * kill comments * patch on memory leaks fix * fix formatting issues * remove redundant files * disable the Gan example for now * add ignore method * add new download scheme to match the changes

…req='add' (apache#11338) * Add tests that fail due to issue 11241 * Fix apache#11241 Conv1D throws CUDNN_STATUS_EXECUTION_FAILED * Force algo 1 when grad_req==add with large c. Expand tests. * Shorten test runtimes.

eric-haibin-lin added Bug CUDA labels Jun 12, 2018

DickJC123 mentioned this issue Jun 19, 2018

[MXNET-11241] Avoid use of troublesome cudnnFind() results when grad_req='add' #11338

Merged

2 tasks

DickJC123 added a commit to DickJC123/mxnet that referenced this issue Jun 20, 2018

Fix apache#11241 Conv1D throws CUDNN_STATUS_EXECUTION_FAILED

259bb94

DickJC123 added a commit to DickJC123/mxnet that referenced this issue Jul 16, 2018

Fix apache#11241 Conv1D throws CUDNN_STATUS_EXECUTION_FAILED

ca60250

eric-haibin-lin closed this as completed in 024b5a9 Jul 30, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Conv1D throws CUDNN_STATUS_EXECUTION_FAILED #11241

Conv1D throws CUDNN_STATUS_EXECUTION_FAILED #11241

eric-haibin-lin commented Jun 12, 2018 •

edited

Loading

eric-haibin-lin commented Jun 12, 2018 •

edited

Loading

DickJC123 commented Jun 12, 2018

eric-haibin-lin commented Jun 12, 2018

eric-haibin-lin commented Jun 18, 2018

Conv1D throws CUDNN_STATUS_EXECUTION_FAILED #11241

Conv1D throws CUDNN_STATUS_EXECUTION_FAILED #11241

Comments

eric-haibin-lin commented Jun 12, 2018 • edited Loading

eric-haibin-lin commented Jun 12, 2018 • edited Loading

DickJC123 commented Jun 12, 2018

eric-haibin-lin commented Jun 12, 2018

eric-haibin-lin commented Jun 18, 2018

eric-haibin-lin commented Jun 12, 2018 •

edited

Loading

eric-haibin-lin commented Jun 12, 2018 •

edited

Loading