Flaky test: test_ndarray.test_order #12310

Chancebair · 2018-08-23T15:02:54Z

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/1499/pipeline/

======================================================================
FAIL: test_ndarray.test_order
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/nose/case.py", line 198, in runTest
    self.test(*self.arg)
  File "/work/mxnet/tests/python/unittest/common.py", line 172, in test_new
    orig_test(*args, **kwargs)
  File "/work/mxnet/tests/python/unittest/test_ndarray.py", line 765, in test_order
    assert_almost_equal(nd_ret_argsort, gt)
  File "/work/mxnet/python/mxnet/test_utils.py", line 491, in assert_almost_equal
    raise AssertionError(msg)
AssertionError: 
Items are not equal:
Error 471428.571429 exceeds tolerance rtol=0.000010, atol=0.000000.  Location of maximum error:(476,), a=120.000000, b=21.000000
 a: array([168,  58, 172, ..., 596, 514,  96], dtype=int32)
 b: array([168,  58, 172, ..., 596, 514,  96])

The text was updated successfully, but these errors were encountered:

ankkhedia · 2018-08-25T20:57:19Z

Hi @Chancebair, Thanks for filing the issue. We will look into the test.

ankkhedia · 2018-08-28T20:15:57Z

It seems like there is a problem while using different dtype ndarrays on GPU with topk operator.
It can be reproduced by running small code snippet with latest mxnet build

import mxnet as mx
import numpy as np
dat_size=5
dtype=np.int64
a_npy= np.arange(dat_size ** 4, dtype=dtype).reshape((dat_size, dat_size, dat_size, dat_size))
a_nd = mx.nd.array(a_npy, ctx=mx.gpu(0), dtype=dtype)
nd_ret_topk = mx.nd.topk(a_nd, axis=1, ret_typ="mask", k=2, is_ascend=False).asnumpy()

The above code snippet led to this failure:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ubuntu/incubator-mxnet/python/mxnet/ndarray/ndarray.py", line 1976, in asnumpy
    ctypes.c_size_t(data.size)))
  File "/home/ubuntu/incubator-mxnet/python/mxnet/base.py", line 252, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [18:53:51] /home/ubuntu/incubator-mxnet/3rdparty/mshadow/mshadow/./stream_gpu-inl.h:62: Check failed: e == cudaSuccess CUDA: misaligned address

Stack trace returned 10 entries:
[bt] (0) /home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::StackTrace[abi:cxx11]()+0x5b) [0x7f92bac6005b]
[bt] (1) /home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7f92bac60bc8]
[bt] (2) /home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(mshadow::Stream<mshadow::gpu>::Wait()+0xd8) [0x7f92bd4a6ee8]
[bt] (3) /home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::imperative::PushFCompute(std::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)> const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&)::{lambda(mxnet::RunContext)#1}::operator()(mxnet::RunContext) const+0x3dc) [0x7f92bd64561c]
[bt] (4) /home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(+0x3c2cacb) [0x7f92bdb96acb]
[bt] (5) /home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0x8e5) [0x7f92bdb90f25]
[bt] (6) /home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>*, std::shared_ptr<dmlc::ManualEvent> const&)+0xeb) [0x7f92bdba789b]
[bt] (7) /home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void (std::shared_ptr<dmlc::ManualEvent>), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#4}::operator()() const::{lambda(std::shared_ptr<dmlc::ManualEvent>)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr<dmlc::ManualEvent>&&)+0x4e) [0x7f92bdba7b0e]
[bt] (8) /home/ubuntu/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(std::thread::_Impl<std::_Bind_simple<std::function<void (std::shared_ptr<dmlc::ManualEvent>)> (std::shared_ptr<dmlc::ManualEvent>)> >::_M_run()+0x4a) [0x7f92bdb9052a]
[bt] (9) /home/ubuntu/anaconda3/bin/../lib/libstdc++.so.6(+0xafc5c) [0x7f9289c86c5c]

The support was added recently as a part of this PR:
#12250

@sxjscience

sxjscience · 2018-08-28T23:40:52Z

Thanks for reporting this. I find we can use other dtypes

import mxnet as mx
import numpy as np
dat_size=5
dtype=np.int32
a_npy= np.arange(dat_size ** 4, dtype=dtype).reshape((dat_size, dat_size, dat_size, dat_size))
a_nd = mx.nd.array(a_npy, ctx=mx.gpu(0), dtype=dtype)
nd_ret_topk = mx.nd.topk(a_nd, axis=1, k=2, ret_typ="mask", is_ascend=False)
print(nd_ret_topk.dtype)
print(nd_ret_topk)

I'm looking for the bug in the code.

ankkhedia · 2018-08-28T23:54:36Z

@sxjscience The issue seems to be only with ret_type = "mask" for topk operator

lebeg · 2018-10-01T17:37:48Z

This test is failing in the 1.3.x branch as well: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/v1.3.x/40/pipeline

piyushghai · 2018-10-10T18:32:10Z

@ankkhedia Did your PR fix this issue for the failing test ?
@lebeg Have we been able to see any more instances of this failure in recent times ?
If it's fixed, can we close this issue ? :)

ankkhedia · 2018-10-10T18:36:36Z

@piyushghai The flakiness of the test is fixed but the PR for fixing topk operator is still not merged.
#12446
Some of the tests has been disabled as a part of the above fix which needs to be re-enabled once the operator gets fixed

sxjscience · 2018-10-10T23:23:20Z

I’m looking into the problem again. There is a memory access error in the CUB part reported by CUDA-MEMCHECK. Still do not know why. Get Outlook for iOS<https://aka.ms/o0ukef>

…

________________________________ From: Ankit Khedia <notifications@github.com> Sent: Thursday, October 11, 2018 2:37:25 AM To: apache/incubator-mxnet Cc: Xingjian SHI; Mention Subject: Re: [apache/incubator-mxnet] Flaky test: test_ndarray.test_order (#12310) @piyushghai<https://github.com/piyushghai> The flakiness of the test is fixed but the PR for fixing topk operator is still not merged. #12446<#12446> Some of the tests has been disabled as a part of the above fix which needs to be re-enabled once the operator gets fixed — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#12310 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AE8D7pGeHzK97JY774GkCVSpjY1LPEtLks5ujj5lgaJpZM4WJtq0>.

sxjscience · 2018-10-11T11:06:06Z

The bug is caused by memory misalignment. As stated in https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#device-memory-accesses , CUDA device memory must be aligned. Currently, we slice a char * object to generate pointers of other dtypes. In this process, we haven't aligned these ptrs correctly. This triggers the "CUDA: misaligned address" error.

This is a common mistake in many other implementations. The ultimate solution would be a helper function for allocating spaces for tensors with different dtypes and shapes, which could be added here https://github.com/apache/incubator-mxnet/blob/master/include/mxnet/resource.h#L152-L159.

For now, I'll submit a PR to fix this problem for topk.

ankkhedia · 2018-10-30T23:51:58Z

@sandeep-krishnamurthy The issue can be closed as the fixes has been merged.

@Chancebair Please feel free to reopen if closed in error.

Chancebair mentioned this issue Aug 23, 2018

Disable flaky test: test_ndarray.test_order #12311

Merged

marcoabreu added Test Flaky Disabled test labels Aug 23, 2018

This was referenced Aug 24, 2018

Inconsistent type conversion from numpy.ndarray to mx.ndarray #12268

Closed

Fix flaky test : test_ndarray.test_order #12358

Merged

sxjscience mentioned this issue Sep 3, 2018

[WIP][Bugfix] Fix flaky topk #12446

Closed

6 tasks

ankkhedia mentioned this issue Oct 2, 2018

[Backport] Fix for test_ndarray.test_order failing on CI (v1.3.x) #12725

Merged

4 tasks

sxjscience mentioned this issue Oct 11, 2018

Fix Flaky Topk #12798

Merged

6 tasks

sandeep-krishnamurthy closed this as completed Oct 31, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flaky test: test_ndarray.test_order #12310

Flaky test: test_ndarray.test_order #12310

Chancebair commented Aug 23, 2018

ankkhedia commented Aug 25, 2018

ankkhedia commented Aug 28, 2018 •

edited

Loading

sxjscience commented Aug 28, 2018

ankkhedia commented Aug 28, 2018

lebeg commented Oct 1, 2018

piyushghai commented Oct 10, 2018

ankkhedia commented Oct 10, 2018

sxjscience commented Oct 10, 2018 via email

sxjscience commented Oct 11, 2018

ankkhedia commented Oct 30, 2018

Flaky test: test_ndarray.test_order #12310

Flaky test: test_ndarray.test_order #12310

Comments

Chancebair commented Aug 23, 2018

ankkhedia commented Aug 25, 2018

ankkhedia commented Aug 28, 2018 • edited Loading

sxjscience commented Aug 28, 2018

ankkhedia commented Aug 28, 2018

lebeg commented Oct 1, 2018

piyushghai commented Oct 10, 2018

ankkhedia commented Oct 10, 2018

sxjscience commented Oct 10, 2018 via email

sxjscience commented Oct 11, 2018

ankkhedia commented Oct 30, 2018

ankkhedia commented Aug 28, 2018 •

edited

Loading