Flaky test: test_operator_gpu.test_sequence_last causes 'CUDA: unspecified launch failure' #11395

marcoabreu · 2018-06-25T21:02:17Z

Sometimes, our slaves get corrupted and suddenly all test start to fail. This is unrelated to the tests directly.

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-11377/5/pipeline/

======================================================================

ERROR: test_operator_gpu.test_op_roi_align

----------------------------------------------------------------------

Traceback (most recent call last):

  File "C:\Anaconda3\envs\py2\lib\site-packages\nose\case.py", line 197, in runTest

    self.test(*self.arg)

  File "C:\Anaconda3\envs\py2\lib\site-packages\nose\util.py", line 620, in newfunc

    return func(*arg, **kw)

  File "C:\jenkins_slave\workspace\ut-python-gpu\tests\python\gpu\../unittest\common.py", line 157, in test_new

    orig_test(*args, **kwargs)

  File "C:\jenkins_slave\workspace\ut-python-gpu\tests\python\gpu\../unittest\test_operator.py", line 6269, in test_op_roi_align

    test_roi_align_value()

  File "C:\jenkins_slave\workspace\ut-python-gpu\tests\python\gpu\../unittest\test_operator.py", line 6230, in test_roi_align_value

    data = mx.nd.array(np.arange(N*C*W*H).reshape((N,C,H,W)), ctx=ctx, dtype = dtype)

  File "C:\jenkins_slave\workspace\ut-python-gpu\pkg_vc14_gpu\python\mxnet\ndarray\utils.py", line 146, in array

    return _array(source_array, ctx=ctx, dtype=dtype)

  File "C:\jenkins_slave\workspace\ut-python-gpu\pkg_vc14_gpu\python\mxnet\ndarray\ndarray.py", line 2357, in array

    arr[:] = source_array

  File "C:\jenkins_slave\workspace\ut-python-gpu\pkg_vc14_gpu\python\mxnet\ndarray\ndarray.py", line 444, in __setitem__

    self._set_nd_basic_indexing(key, value)

  File "C:\jenkins_slave\workspace\ut-python-gpu\pkg_vc14_gpu\python\mxnet\ndarray\ndarray.py", line 710, in _set_nd_basic_indexing

    self._sync_copyfrom(value)

  File "C:\jenkins_slave\workspace\ut-python-gpu\pkg_vc14_gpu\python\mxnet\ndarray\ndarray.py", line 876, in _sync_copyfrom

    ctypes.c_size_t(source_array.size)))

  File "C:\jenkins_slave\workspace\ut-python-gpu\pkg_vc14_gpu\python\mxnet\base.py", line 210, in check_call

    raise MXNetError(py_str(_LIB.MXGetLastError()))

MXNetError: [06:35:08] c:\jenkins_slave\workspace\build-gpu\3rdparty\mshadow\mshadow\./tensor_gpu-inl.h:69: Check failed: e == cudaSuccess CUDA: unspecified launch failure

-------------------- >> begin captured logging << --------------------

common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1046236735 to reproduce.

--------------------- >> end captured logging << ---------------------

The text was updated successfully, but these errors were encountered:

marcoabreu · 2018-06-25T21:02:59Z

@DickJC123

DickJC123 · 2018-06-25T21:28:04Z

In a private communication, you indicated this was seen on all platforms. Here you tag it as 'Windows'. Please clarify.

marcoabreu · 2018-06-26T11:06:18Z

Sorry Dick, I just double checked my database and it seems to only happen on Windows. It seems like I mixed something up, please excuse me for that.

Config: Windows Server 2016, G3.8xlarge, CUDA8, unknown driver version

anirudhacharya · 2018-06-26T19:18:09Z

Another Jenkins log for flaky failure of test_operator_gpu.test_op_roi_align - http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/incubator-mxnet/branches/PR-10889/runs/17/nodes/751/log/?start=0

marcoabreu · 2018-06-27T23:12:53Z

This seems to be caused by test_conv:
http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/incubator-mxnet/branches/master/runs/1050/nodes/752/steps/1370/log/?start=0

test_operator_gpu.test_ndarray_crop ... ok
test_operator_gpu.test_cell_fill_shape ... ok
test_operator_gpu.test_conv ... [00:06:32] c:\jenkins_slave\workspace\build-gpu\src\operator\nn\cudnn\./cudnn_algoreg-inl.h:107: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
[00:06:32] C:/jenkins_slave/workspace/build-gpu/src/operator/nn/convolution.cu:148: This convolution is not supported by cudnn, MXNET convolution is applied.
[00:06:32] C:/jenkins_slave/workspace/build-gpu/src/operator/nn/convolution.cu:227: This convolution is not supported by cudnn, MXNET convolution is applied.
[00:06:32] C:/jenkins_slave/workspace/build-gpu/src/operator/nn/convolution.cu:148: This convolution is not supported by cudnn, MXNET convolution is applied.
[00:06:32] C:/jenkins_slave/workspace/build-gpu/src/operator/nn/convolution.cu:227: This convolution is not supported by cudnn, MXNET convolution is applied.
[INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=448545343 to reproduce.
[INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1330733257 to reproduce.
ERROR
test_operator_gpu.test_layer_fill_shape ... ERROR
test_operator_gpu.test_normal_generator ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1592512793 to reproduce.
ERROR
test_operator_gpu.test_ndarray_concatenate ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=466881030 to reproduce.
ERROR
test_operator_gpu.test_sparse_nd_transpose ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=221598447 to reproduce.
ERROR
test_operator_gpu.test_sparse_nd_storage_fallback ... [00:06:32] c:\jenkins_slave\workspace\build-gpu\src\operator\../common/utils.h:417: 
Storage type fallback detected:
operator = broadcast_add
input storage types = [default, default, ]
output storage types = [csr, ]
params = {}
context.dev_mask = gpu
The operator with default storage type will be dispatched for execution. You're seeing this warning message because the operator above is unable to process the given ndarrays with specified storage types, context and parameter. Temporary dense ndarrays are generated in order to execute the operator. This does not affect the correctness of the programme. You can set environment variable MXNET_STORAGE_FALLBACK_LOG_VERBOSE to 0 to suppress this warning.
[INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1026960206 to reproduce.
ERROR
test_operator_gpu.test_clip ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1647052128 to reproduce.
ERROR
test_operator_gpu.test_convolution_with_type ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=1234 to reproduce.
ERROR

DickJC123 · 2018-06-28T19:51:19Z

I have a lead on the problem. There is an out-of-bound read performed by the SequenceLastKernel. I'll stop here and let the person responsible for this kernel correct the problem. Kernels that read beyond their valid input tensor regions can be problematic, even if the random data read is never used in a subsequent kernel write. The problem surfaces when the reads are outside of valid mapped address ranges, which results in an unservicable TLB miss. The problems can be non-deterministic since the input tensors may have non-deterministic placement within their mapped pages.

I debugged the problem by going to the first test that showed the failure in one of the above posts, captured the MXNET_TEST_SEED, and then reproduced the error (on Linux no less) with the following command:

MXNET_TEST_SEED=731510245 cuda-memcheck nosetests --verbose -s tests/python/gpu/test_operator_gpu.py:test_sequence_last | c++filt
[INFO] Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=1613755850 to reproduce.
[WARNING] *** test-level seed set: all "@with_seed()" tests run deterministically ***
test_operator_gpu.test_sequence_last ... [INFO] Setting test np/mx/python random seeds, use MXNET_TEST_SEED=731510245 to reproduce.
========= CUDA-MEMCHECK
========= Invalid __global__ read of size 4
=========     at 0x00000390 in void mxnet::op::mxnet_op::mxnet_generic_kernel<mxnet::op::SequenceLastKernel<1>, float*, float*, float*, int, int, mshadow::Shape<2> >(int, float*, float*, float*, int, int, mshadow::Shape<2>)
=========     by thread (2,0,0) in block (0,0,0)
=========     Address 0x7f13f24003f8 is out of bounds
=========     Device Frame:void mxnet::op::mxnet_op::mxnet_generic_kernel<mxnet::op::SequenceLastKernel<1>, float*, float*, float*, int, int, mshadow::Shape<2> >(int, float*, float*, float*, int, int, mshadow::Shape<2>) (void mxnet::op::mxnet_op::mxnet_generic_kernel<mxnet::op::SequenceLastKernel<1>, float*, float*, float*, int, int, mshadow::Shape<2> >(int, float*, float*, float*, int, int, mshadow::Shape<2>) : 0x390)
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x24cc4d]
=========     Host Frame:/usr/local/cuda/lib64/libcudart.so.9.0 [0x15680]
=========     Host Frame:/usr/local/cuda/lib64/libcudart.so.9.0 (cudaLaunch + 0x14e) [0x33c9e]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (mxnet::op::SequenceLastOp<mshadow::gpu, float>::Forward(mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&) + 0xc3a) [0x53384da]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (mxnet::op::OperatorState::Forward(mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&) + 0x363) [0x3214a53]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (mxnet::exec::StatefulComputeExecutor::Run(mxnet::RunContext, bool) + 0x59) [0x3808f09]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so [0x37d5870]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*) + 0x8e5) [0x372ea15]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>*, std::shared_ptr<dmlc::ManualEvent> const&) + 0xeb) [0x3745b1b]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#3}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>&&) + 0x4e) [0x3745d8e]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (std::thread::_Impl<std::_Bind_simple<std::function<void (std::shared_ptr<dmlc::ManualEvent>)> (std::shared_ptr<dmlc::ManualEvent>)> >::_M_run() + 0x4a) [0x372e01a]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libstdc++.so.6 [0xb8c80]
=========     Host Frame:/lib/x86_64-linux-gnu/libpthread.so.0 [0x76ba]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (clone + 0x6d) [0x10741d]
=========
========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launch failure" on CUDA API call to cudaStreamSynchronize. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x3496d3]
=========     Host Frame:/usr/local/cuda/lib64/libcudart.so.9.0 (cudaStreamSynchronize + 0x176) [0x47336]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (mshadow::Stream<mshadow::gpu>::Wait() + 0x26) [0x32635a6]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so [0x37d5945]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*) + 0x8e5) [0x372ea15]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>*, std::shared_ptr<dmlc::ManualEvent> const&) + 0xeb) [0x3745b1b]
ERROR

======================================================================
ERROR: test_operator_gpu.test_sequence_last
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/nose/case.py", line 197, in runTest
    self.test(*self.arg)
  File "/usr/lib/python2.7/dist-packages/nose/util.py", line 620, in newfunc
    return func(*arg, **kw)
  File "/home/dcarter/mxnet_dev/dgx/mxnet/tests/python/gpu/../unittest/common.py", line 157, in test_new
    orig_test(*args, **kwargs)
  File "/home/dcarter/mxnet_dev/dgx/mxnet/tests/python/gpu/../unittest/test_operator.py", line 2998, in test_sequence_last
    check_sequence_func("last", axis=0)
  File "/home/dcarter/mxnet_dev/dgx/mxnet/tests/python/gpu/../unittest/test_operator.py", line 2989, in check_sequence_func
    numeric_eps=1e-2, rtol=1e-2)
  File "/home/dcarter/mxnet_dev/dgx/mxnet/python/mxnet/test_utils.py", line 906, in check_numeric_gradient
    eps=numeric_eps, use_forward_train=use_forward_train, dtype=dtype)
  File "/home/dcarter/mxnet_dev/dgx/mxnet/python/mxnet/test_utils.py", line 781, in numeric_grad
    f_neps = executor.outputs[0].asnumpy()
  File "/home/dcarter/mxnet_dev/dgx/mxnet/python/mxnet/ndarray/ndarray.py", line 1894, in asnumpy
    ctypes.c_size_t(data.size)))
  File "/home/dcarter/mxnet_dev/dgx/mxnet/python/mxnet/base.py", line 210, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
MXNetError: [11:31:45] /home/dcarter/mxnet_dev/dgx/mxnet/3rdparty/mshadow/mshadow/./stream_gpu-inl.h:62: Check failed: e == cudaSuccess CUDA: unspecified launch failure

Stack trace returned 10 entries:
[bt] (0) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x5b) [0x7f14f0b6619b]
[bt] (1) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7f14f0b66d08]
[bt] (2) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(mshadow::Stream<mshadow::gpu>::Wait()+0xd8) [0x7f14f31bc658]
[bt] (3) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(+0x37d5945) [0x7f14f372e945]
[bt] (4) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0x8e5) [0x7f14f3687a15]
[bt] (5) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>*, std::shared_ptr<dmlc::ManualEvent> const&)+0xeb) [0x7f14f369eb1b]
[bt] (6) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#3}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>&&)+0x4e) [0x7f14f369ed8e]
[bt] (7) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(std::thread::_Impl<std::_Bind_simple<std::function<void (std::shared_ptr<dmlc::ManualEvent>)> (std::shared_ptr<dmlc::ManualEvent>)> >::_M_run()+0x4a) [0x7f14f368701a]
[bt] (8) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f151efb1c80]
[bt] (9) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f15269af6ba]


-------------------- >> begin captured logging << --------------------
common: INFO: Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=1613755850 to reproduce.
common: WARNING: *** test-level seed set: all "@with_seed()" tests run deterministically ***
common: INFO: Setting test np/mx/python random seeds, use MXNET_TEST_SEED=731510245 to reproduce.
--------------------- >> end captured logging << ---------------------

----------------------------------------------------------------------
Ran 1 test in 22.915s

FAILED (errors=1)
terminate called after throwing an instance of 'dmlc::Error'
  what():  [11:31:45] src/storage/./pooled_storage_manager.h:77: CUDA: unspecified launch failure

Stack trace returned 10 entries:
[bt] (0) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x5b) [0x7f14f0b6619b]
[bt] (1) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7f14f0b66d08]
[bt] (2) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(mxnet::storage::GPUPooledStorageManager::DirectFreeNoLock(mxnet::Storage::Handle)+0x8f) [0x7f14f36aa8cf]
[bt] (3) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(mxnet::storage::GPUPooledStorageManager::ReleaseAll()+0x95) [0x7f14f36a2ef5]
[bt] (4) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(mxnet::storage::GPUPooledStorageManager::~GPUPooledStorageManager()+0x1a) [0x7f14f36aa9ca]
[bt] (5) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(std::_Sp_counted_ptr<mxnet::StorageImpl*, (__gnu_cxx::_Lock_policy)2>::_M_dispose()+0xa23) [0x7f14f36a9bd3]
[bt] (6) /home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so(std::shared_ptr<mxnet::Storage>::~shared_ptr()+0x52) [0x7f14f36aa822]
[bt] (7) /lib/x86_64-linux-gnu/libc.so.6(+0x39ff8) [0x7f1526617ff8]
[bt] (8) /lib/x86_64-linux-gnu/libc.so.6(+0x3a045) [0x7f1526618045]
[bt] (9) /usr/bin/python() [0x51dc1f]


=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#3}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>&&) + 0x4e) [0x3745d8e]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (std::thread::_Impl<std::_Bind_simple<std::function<void (std::shared_ptr<dmlc::ManualEvent>)> (std::shared_ptr<dmlc::ManualEvent>)> >::_M_run() + 0x4a) [0x372e01a]
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libstdc++.so.6 [0xb8c80]
=========     Host Frame:/lib/x86_64-linux-gnu/libpthread.so.0 [0x76ba]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (clone + 0x6d) [0x10741d]
=========
========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launch failure" on CUDA API call to cudaFree. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 [0x3496d3]
=========     Host Frame:/usr/local/cuda/lib64/libcudart.so.9.0 (cudaFree + 0x1a0) [0x419b0]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (mxnet::storage::GPUPooledStorageManager::DirectFreeNoLock(mxnet::Storage::Handle) + 0x32) [0x3751872]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (mxnet::storage::GPUPooledStorageManager::ReleaseAll() + 0x95) [0x3749ef5]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (mxnet::storage::GPUPooledStorageManager::~GPUPooledStorageManager() + 0x1a) [0x37519ca]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (std::_Sp_counted_ptr<mxnet::StorageImpl*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0xa23) [0x3750bd3]
=========     Host Frame:/home/dcarter/mxnet_dev/dgx/mxnet/lib/libmxnet.so (std::shared_ptr<mxnet::Storage>::~shared_ptr() + 0x52) [0x3751822]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 [0x39ff8]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 [0x3a045]
=========     Host Frame:/usr/bin/python [0x11dc1f]
=========     Host Frame:/usr/bin/python [0x11b1b7]
=========     Host Frame:/usr/bin/python (PyErr_PrintEx + 0x2d) [0x11aadd]
=========     Host Frame:/usr/bin/python [0x309d5]
=========     Host Frame:/usr/bin/python (Py_Main + 0x612) [0x93ae2]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf0) [0x20830]
=========     Host Frame:/usr/bin/python (_start + 0x29) [0x933e9]
=========
========= Error: process didn't terminate successfully
========= No CUDA-MEMCHECK results found

aaronmarkham · 2018-06-28T22:27:03Z

I'm still blocked by this. Was it really disabled?

KellenSunderland · 2018-06-29T08:38:23Z

I don't think anyone disabled the test. Opening a PR to do so here: #11485

anirudh2290 · 2018-07-12T16:03:28Z

@DickJC123 Thank you for investigating this issue! I am not able to reproduce the test failure pointed by you with the commit : b786ead with the same test seed: 731510245 . Are you able to reproduce the issue with the latest master ?

marcoabreu · 2018-07-12T16:21:28Z

(I'll answer on behalf of dick because he told me that he will be quite busy around this time)
Hey anirudh, could you elaborate how you tried to reproduce this issue? Please be aware that this is about memory corruption because of out of bounds writes. This behaviour might now always be reproducible locally. Did you use cuda memcheck to investigate?

anirudh2290 · 2018-07-12T16:44:55Z

yes i am not able to similar errors with cuda memcheck either.

anirudh2290 · 2018-07-12T16:45:31Z

Tried the following:

MXNET_TEST_SEED=731510245 nosetests --verbose -s tests/python/gpu/test_operator_gpu.py:test_sequence_last

Also did:

MXNET_TEST_SEED=731510245 cuda-memcheck nosetests --verbose -s tests/python/gpu/test_operator_gpu.py:test_sequence_last

couldn't reproduce Invalid __global__ read of size 4 with cudamemcheck

marcoabreu · 2018-07-12T16:54:49Z

How often did you run?

anirudh2290 · 2018-07-12T16:56:38Z

ran repeatedly around 2k times.

ChaiBapchya · 2020-01-30T05:29:27Z

Is this related? CUDA: unspecified launch failure- http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fwindows-gpu/detail/PR-17444/2/pipeline/

ChaiBapchya · 2020-01-30T18:51:40Z

Another windows gpu cuda launch error : http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fwindows-gpu/detail/PR-17444/4/pipeline #17444

aaronmarkham · 2020-01-31T18:31:58Z

Unspecified launch failure here:
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fwindows-gpu/detail/PR-17435/7/pipeline

What's the workaround? Need to a get PR through that has nothing to do with CUDA...

ChaiBapchya · 2020-01-31T19:06:09Z

I haven't found one. I kept retriggering and finally gave up.

kshitij12345 · 2020-02-02T05:43:39Z

Unspecified Launch Failure:
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fwindows-gpu/detail/PR-17435/7/pipeline/#step-169-log-1210

aaronmarkham · 2020-02-17T17:02:54Z

Again here.... and another I just restarted...
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fwindows-gpu/detail/PR-17594/1/pipeline

aaronmarkham · 2020-02-18T00:44:50Z

Again: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fwindows-gpu/detail/PR-17594/3/pipeline

marcoabreu added Windows Test Flaky labels Jun 25, 2018

marcoabreu mentioned this issue Jun 25, 2018

Flaky test on Python2 Windows #11394

Closed

marcoabreu added the Breaking label Jun 27, 2018

marcoabreu changed the title ~~Check failed: e == cudaSuccess CUDA: unspecified launch failure~~ Flaky test: test_operator_gpu.test_conv causes 'CUDA: unspecified launch failure' Jun 27, 2018

marcoabreu added the Disabled test label Jun 27, 2018

This was referenced Jun 28, 2018

Flaky test: test_operator.test_op_roi_align #11064

Closed

Add ability to query cuDNN BatchNorm min. epsilon. Allow ONNX importer to use cuDNN BN if chosen eps >= cuDNN min. eps. #11380

Merged

[MXNET-564] Fix the Flaky test on arange #11377

Merged

KellenSunderland mentioned this issue Jun 29, 2018

[MXNET-609] Disable flaky test test_sequence_last #11485

Merged

5 tasks

marcoabreu changed the title ~~Flaky test: test_operator_gpu.test_conv causes 'CUDA: unspecified launch failure'~~ Flaky test: test_operator_gpu.test_sequence_last causes 'CUDA: unspecified launch failure' Jun 29, 2018

marcoabreu mentioned this issue Jun 29, 2018

Re-enable test_conv #11490

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flaky test: test_operator_gpu.test_sequence_last causes 'CUDA: unspecified launch failure' #11395

Flaky test: test_operator_gpu.test_sequence_last causes 'CUDA: unspecified launch failure' #11395

marcoabreu commented Jun 25, 2018

marcoabreu commented Jun 25, 2018

DickJC123 commented Jun 25, 2018

marcoabreu commented Jun 26, 2018

anirudhacharya commented Jun 26, 2018

marcoabreu commented Jun 27, 2018

DickJC123 commented Jun 28, 2018 •

edited

Loading

aaronmarkham commented Jun 28, 2018

KellenSunderland commented Jun 29, 2018

anirudh2290 commented Jul 12, 2018

marcoabreu commented Jul 12, 2018 •

edited

Loading

anirudh2290 commented Jul 12, 2018

anirudh2290 commented Jul 12, 2018 •

edited

Loading

marcoabreu commented Jul 12, 2018

anirudh2290 commented Jul 12, 2018

ChaiBapchya commented Jan 30, 2020

ChaiBapchya commented Jan 30, 2020

aaronmarkham commented Jan 31, 2020

ChaiBapchya commented Jan 31, 2020

kshitij12345 commented Feb 2, 2020

aaronmarkham commented Feb 17, 2020

aaronmarkham commented Feb 18, 2020

Flaky test: test_operator_gpu.test_sequence_last causes 'CUDA: unspecified launch failure' #11395

Flaky test: test_operator_gpu.test_sequence_last causes 'CUDA: unspecified launch failure' #11395

Comments

marcoabreu commented Jun 25, 2018

marcoabreu commented Jun 25, 2018

DickJC123 commented Jun 25, 2018

marcoabreu commented Jun 26, 2018

anirudhacharya commented Jun 26, 2018

marcoabreu commented Jun 27, 2018

DickJC123 commented Jun 28, 2018 • edited Loading

aaronmarkham commented Jun 28, 2018

KellenSunderland commented Jun 29, 2018

anirudh2290 commented Jul 12, 2018

marcoabreu commented Jul 12, 2018 • edited Loading

anirudh2290 commented Jul 12, 2018

anirudh2290 commented Jul 12, 2018 • edited Loading

marcoabreu commented Jul 12, 2018

anirudh2290 commented Jul 12, 2018

ChaiBapchya commented Jan 30, 2020

ChaiBapchya commented Jan 30, 2020

aaronmarkham commented Jan 31, 2020

ChaiBapchya commented Jan 31, 2020

kshitij12345 commented Feb 2, 2020

aaronmarkham commented Feb 17, 2020

aaronmarkham commented Feb 18, 2020

DickJC123 commented Jun 28, 2018 •

edited

Loading

marcoabreu commented Jul 12, 2018 •

edited

Loading

anirudh2290 commented Jul 12, 2018 •

edited

Loading