Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Flaky test: test_ndarray.test_order #12310

Closed
Chancebair opened this issue Aug 23, 2018 · 10 comments
Closed

Flaky test: test_ndarray.test_order #12310

Chancebair opened this issue Aug 23, 2018 · 10 comments

Comments

@Chancebair
Copy link
Contributor

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/1499/pipeline/

======================================================================
FAIL: test_ndarray.test_order
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/work/mxnet/tests/python/unittest/common.py", line 172, in test_new
    orig_test(*args, **kwargs)
  File "/work/mxnet/tests/python/unittest/test_ndarray.py", line 765, in test_order
    assert_almost_equal(nd_ret_argsort, gt)
  File "/work/mxnet/python/mxnet/test_utils.py", line 491, in assert_almost_equal
    raise AssertionError(msg)
AssertionError: 
Items are not equal:
Error 471428.571429 exceeds tolerance rtol=0.000010, atol=0.000000.  Location of maximum error:(476,), a=120.000000, b=21.000000
 a: array([168,  58, 172, ..., 596, 514,  96], dtype=int32)
 b: array([168,  58, 172, ..., 596, 514,  96])

@ankkhedia
Copy link
Contributor

Hi @Chancebair, Thanks for filing the issue. We will look into the test.

@ankkhedia
Copy link
Contributor

ankkhedia commented Aug 28, 2018

It seems like there is a problem while using different dtype ndarrays on GPU with topk operator.
It can be reproduced by running small code snippet with latest mxnet build

import mxnet as mx
import numpy as np
dat_size=5
dtype=np.int64
a_npy= np.arange(dat_size ** 4, dtype=dtype).reshape((dat_size, dat_size, dat_size, dat_size))
a_nd = mx.nd.array(a_npy, ctx=mx.gpu(0), dtype=dtype)
nd_ret_topk = mx.nd.topk(a_nd, axis=1, ret_typ="mask", k=2, is_ascend=False).asnumpy()

The above code snippet led to this failure:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ubuntu/incubator-mxnet/python/mxnet/ndarray/ndarray.py", line 1976, in asnumpy
    ctypes.c_size_t(data.size)))
  File "/home/ubuntu/incubator-mxnet/python/mxnet/base.py", line 252, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [18:53:51] /home/ubuntu/incubator-mxnet/3rdparty/mshadow/mshadow/./stream_gpu-inl.h:62: Check failed: e == cudaSuccess CUDA: misaligned address

Stack trace returned 10 entries:
[bt] (0) /home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x5b) [0x7f92bac6005b]
[bt] (1) /home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7f92bac60bc8]
[bt] (2) /home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(mshadow::Stream<mshadow::gpu>::Wait()+0xd8) [0x7f92bd4a6ee8]
[bt] (3) /home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::imperative::PushFCompute(std::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)> const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&)::{lambda(mxnet::RunContext)#1}::operator()(mxnet::RunContext) const+0x3dc) [0x7f92bd64561c]
[bt] (4) /home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(+0x3c2cacb) [0x7f92bdb96acb]
[bt] (5) /home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0x8e5) [0x7f92bdb90f25]
[bt] (6) /home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>*, std::shared_ptr<dmlc::ManualEvent> const&)+0xeb) [0x7f92bdba789b]
[bt] (7) /home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#4}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>&&)+0x4e) [0x7f92bdba7b0e]
[bt] (8) /home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(std::thread::_Impl<std::_Bind_simple<std::function<void (std::shared_ptr<dmlc::ManualEvent>)> (std::shared_ptr<dmlc::ManualEvent>)> >::_M_run()+0x4a) [0x7f92bdb9052a]
[bt] (9) /home/ubuntu/anaconda3/bin/../lib/libstdc++.so.6(+0xafc5c) [0x7f9289c86c5c]

The support was added recently as a part of this PR:
#12250

@sxjscience

@sxjscience
Copy link
Member

Thanks for reporting this. I find we can use other dtypes

import mxnet as mx
import numpy as np
dat_size=5
dtype=np.int32
a_npy= np.arange(dat_size ** 4, dtype=dtype).reshape((dat_size, dat_size, dat_size, dat_size))
a_nd = mx.nd.array(a_npy, ctx=mx.gpu(0), dtype=dtype)
nd_ret_topk = mx.nd.topk(a_nd, axis=1, k=2, ret_typ="mask", is_ascend=False)
print(nd_ret_topk.dtype)
print(nd_ret_topk)

I'm looking for the bug in the code.

@ankkhedia
Copy link
Contributor

@sxjscience The issue seems to be only with ret_type = "mask" for topk operator

@lebeg
Copy link
Contributor

lebeg commented Oct 1, 2018

@piyushghai
Copy link
Contributor

@ankkhedia Did your PR fix this issue for the failing test ?
@lebeg Have we been able to see any more instances of this failure in recent times ?
If it's fixed, can we close this issue ? :)

@ankkhedia
Copy link
Contributor

@piyushghai The flakiness of the test is fixed but the PR for fixing topk operator is still not merged.
#12446
Some of the tests has been disabled as a part of the above fix which needs to be re-enabled once the operator gets fixed

@sxjscience
Copy link
Member

sxjscience commented Oct 10, 2018 via email

@sxjscience
Copy link
Member

The bug is caused by memory misalignment. As stated in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#device-memory-accesses , CUDA device memory must be aligned. Currently, we slice a char * object to generate pointers of other dtypes. In this process, we haven't aligned these ptrs correctly. This triggers the "CUDA: misaligned address" error.

This is a common mistake in many other implementations. The ultimate solution would be a helper function for allocating spaces for tensors with different dtypes and shapes, which could be added here https://github.com/apache/incubator-mxnet/blob/master/include/mxnet/resource.h#L152-L159.

For now, I'll submit a PR to fix this problem for topk.

@sxjscience sxjscience mentioned this issue Oct 11, 2018
6 tasks
@ankkhedia
Copy link
Contributor

@sandeep-krishnamurthy The issue can be closed as the fixes has been merged.

@Chancebair Please feel free to reopen if closed in error.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants