-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Flaky test: test_ndarray.test_order #12310
Comments
Hi @Chancebair, Thanks for filing the issue. We will look into the test. |
It seems like there is a problem while using different dtype ndarrays on GPU with topk operator. import mxnet as mx
import numpy as np
dat_size=5
dtype=np.int64
a_npy= np.arange(dat_size ** 4, dtype=dtype).reshape((dat_size, dat_size, dat_size, dat_size))
a_nd = mx.nd.array(a_npy, ctx=mx.gpu(0), dtype=dtype)
nd_ret_topk = mx.nd.topk(a_nd, axis=1, ret_typ="mask", k=2, is_ascend=False).asnumpy() The above code snippet led to this failure: Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/ubuntu/incubator-mxnet/python/mxnet/ndarray/ndarray.py", line 1976, in asnumpy
ctypes.c_size_t(data.size)))
File "/home/ubuntu/incubator-mxnet/python/mxnet/base.py", line 252, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [18:53:51] /home/ubuntu/incubator-mxnet/3rdparty/mshadow/mshadow/./stream_gpu-inl.h:62: Check failed: e == cudaSuccess CUDA: misaligned address
Stack trace returned 10 entries:
[bt] (0) /home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x5b) [0x7f92bac6005b]
[bt] (1) /home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7f92bac60bc8]
[bt] (2) /home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(mshadow::Stream<mshadow::gpu>::Wait()+0xd8) [0x7f92bd4a6ee8]
[bt] (3) /home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::imperative::PushFCompute(std::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)> const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&)::{lambda(mxnet::RunContext)#1}::operator()(mxnet::RunContext) const+0x3dc) [0x7f92bd64561c]
[bt] (4) /home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(+0x3c2cacb) [0x7f92bdb96acb]
[bt] (5) /home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0x8e5) [0x7f92bdb90f25]
[bt] (6) /home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>*, std::shared_ptr<dmlc::ManualEvent> const&)+0xeb) [0x7f92bdba789b]
[bt] (7) /home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#4}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>&&)+0x4e) [0x7f92bdba7b0e]
[bt] (8) /home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(std::thread::_Impl<std::_Bind_simple<std::function<void (std::shared_ptr<dmlc::ManualEvent>)> (std::shared_ptr<dmlc::ManualEvent>)> >::_M_run()+0x4a) [0x7f92bdb9052a]
[bt] (9) /home/ubuntu/anaconda3/bin/../lib/libstdc++.so.6(+0xafc5c) [0x7f9289c86c5c] The support was added recently as a part of this PR: |
Thanks for reporting this. I find we can use other dtypes import mxnet as mx
import numpy as np
dat_size=5
dtype=np.int32
a_npy= np.arange(dat_size ** 4, dtype=dtype).reshape((dat_size, dat_size, dat_size, dat_size))
a_nd = mx.nd.array(a_npy, ctx=mx.gpu(0), dtype=dtype)
nd_ret_topk = mx.nd.topk(a_nd, axis=1, k=2, ret_typ="mask", is_ascend=False)
print(nd_ret_topk.dtype)
print(nd_ret_topk) I'm looking for the bug in the code. |
@sxjscience The issue seems to be only with ret_type = "mask" for topk operator |
This test is failing in the 1.3.x branch as well: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/v1.3.x/40/pipeline |
@ankkhedia Did your PR fix this issue for the failing test ? |
@piyushghai The flakiness of the test is fixed but the PR for fixing topk operator is still not merged. |
I’m looking into the problem again. There is a memory access error in the CUB part reported by CUDA-MEMCHECK. Still do not know why.
Get Outlook for iOS<https://aka.ms/o0ukef>
…________________________________
From: Ankit Khedia <notifications@github.com>
Sent: Thursday, October 11, 2018 2:37:25 AM
To: apache/incubator-mxnet
Cc: Xingjian SHI; Mention
Subject: Re: [apache/incubator-mxnet] Flaky test: test_ndarray.test_order (#12310)
@piyushghai<https://github.com/piyushghai> The flakiness of the test is fixed but the PR for fixing topk operator is still not merged.
#12446<#12446>
Some of the tests has been disabled as a part of the above fix which needs to be re-enabled once the operator gets fixed
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#12310 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AE8D7pGeHzK97JY774GkCVSpjY1LPEtLks5ujj5lgaJpZM4WJtq0>.
|
The bug is caused by memory misalignment. As stated in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#device-memory-accesses , CUDA device memory must be aligned. Currently, we slice a This is a common mistake in many other implementations. The ultimate solution would be a helper function for allocating spaces for tensors with different dtypes and shapes, which could be added here https://github.com/apache/incubator-mxnet/blob/master/include/mxnet/resource.h#L152-L159. For now, I'll submit a PR to fix this problem for topk. |
@sandeep-krishnamurthy The issue can be closed as the fixes has been merged. @Chancebair Please feel free to reopen if closed in error. |
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/1499/pipeline/
The text was updated successfully, but these errors were encountered: