[WIP][Bugfix] Fix flaky topk #12446

sxjscience · 2018-09-03T16:07:19Z

Description

This PR fixes the flaky topk test reported in #12358 , #12310. The previous bug is caused by not manually setting the dtype of mx.ndarrays when constructing them from numpy ndarrays (related issue: #12268). After reimplementing the IndexFill function, the test passes now.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

fix topk and its testing codes.

Comments

we really need to fix Inconsistent type conversion from numpy.ndarray to mx.ndarray #12268.

fix try to fix test try to fix flaky test try to fix test fix test fix fix

sxjscience · 2018-09-03T16:08:45Z

@ankkhedia

marcoabreu · 2018-09-03T16:32:10Z

Any reason you are removing int16 support?

sxjscience · 2018-09-03T16:41:23Z

@marcoabreu That's because the ndarray in MXNet does not support int16. So I removed it.

sxjscience · 2018-09-03T18:17:14Z

Find that it still raises a CUDA error. Need more time to fix it.

…

________________________________ From: Marco de Abreu <notifications@github.com> Sent: Tuesday, September 4, 2018 12:32:21 AM To: apache/incubator-mxnet Cc: Xingjian SHI; Author Subject: Re: [apache/incubator-mxnet] [Bugfix] Fix flaky topk (#12446) Any reason you are removing int16 support? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub<#12446 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AE8D7hNu43ypS6NpETeV8-VEAGko2_iWks5uXVmVgaJpZM4WX1yr>.

ankkhedia · 2018-09-12T22:05:50Z

@sxjscience ping!

Did you get some chance to look into the failure? This might be required for fixing the flaky test I referenced.

kalyc · 2018-09-13T18:09:22Z

@sxjscience could you please update the issue? Resolving a flaky test - #12358 is blocked because of this.

sxjscience · 2018-09-13T23:17:22Z

I’ll look into this again. Sorry for the delay. Get Outlook for iOS<https://aka.ms/o0ukef>

…

________________________________ From: Kalyanee Chendke <notifications@github.com> Sent: Friday, September 14, 2018 2:09:41 AM To: apache/incubator-mxnet Cc: Xingjian SHI; Mention Subject: Re: [apache/incubator-mxnet] [WIP][Bugfix] Fix flaky topk (#12446) @sxjscience<https://github.com/sxjscience> could you please update the issue? Resolving a flaky test - #12358<#12358> is blocked because of this. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#12446 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AE8D7vov4UUroFs0tOLMZaIESQKjtBrMks5uap9lgaJpZM4WX1yr>.

kalyc · 2018-09-14T18:28:05Z

@mxnet-label-bot[pr-awaiting-response]

sxjscience · 2018-09-15T05:11:50Z

src/operator/tensor/ordering_op-inl.h

@@ -455,8 +457,7 @@ void TopKImpl(const RunContext &ctx,
  // Cast `ret_indices` from int to real_t could introduce conversion error when the element_num
  // is large enough.
  if (param.ret_typ == topk_enum::kReturnMask) {
-    Tensor<xpu, 2, DType> ret_mask =
-      ret[0].get_with_shape<xpu, 2, DType>(Shape2(ret[0].Size(), 1), s);
+    Tensor<xpu, 1, DType> ret_mask = ret[0].FlatTo1D<xpu, DType>(s);
    ret_mask = scalar<DType>(0);


Now it raises a really weird "CUDA Misaligned Memory Error". I currently having no idea what triggers it. Actually it happens when we initialize the ret_mask to all zero.

@anirudh2290 @azai91 @apeforest @samskalicky @eric-haibin-lin - maybe one of you guys can help?

@sxjscience were you able to resolve the error?

sxjscience · 2018-09-25T17:24:19Z

Sorry I have no time to look at this until Oct 3rd. Need to review lots of AAAI papers + Submit my final thesis. I’ll try to solve it after Oct 3rd. Get Outlook for iOS<https://aka.ms/o0ukef>

________________________________ From: Vandana Kannan <notifications@github.com> Sent: Wednesday, September 26, 2018 1:22:11 AM To: apache/incubator-mxnet Cc: Xingjian SHI; Mention Subject: Re: [apache/incubator-mxnet] [WIP][Bugfix] Fix flaky topk (#12446) @vandanavk commented on this pull request.

________________________________ In src/operator/tensor/ordering_op-inl.h<#12446 (comment)>:

@@ -455,8 +457,7 @@ void TopKImpl(const RunContext &ctx,

// Cast `ret_indices` from int to real_t could introduce conversion error when the element_num // is large enough. if (param.ret_typ == topk_enum::kReturnMask) { - Tensor<xpu, 2, DType> ret_mask = - ret[0].get_with_shape<xpu, 2, DType>(Shape2(ret[0].Size(), 1), s); + Tensor<xpu, 1, DType> ret_mask = ret[0].FlatTo1D<xpu, DType>(s); ret_mask = scalar<DType>(0); @sxjscience<https://github.com/sxjscience> were you able to resolve the error? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#12446 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/AE8D7rj7e2PpZWbaBO4RYxend_Yz_hcpks5uemZDgaJpZM4WX1yr>.

lebeg · 2018-10-01T17:39:56Z

@sxjscience once you've fixed this, could you submit a PR to 1.3.x as well? Currently the branch build is broken due to the test_operator_gpu.test_order test: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/v1.3.x/40/pipeline

try to fix

0cfdfce

fix try to fix test try to fix flaky test try to fix test fix test fix fix

sxjscience requested a review from anirudh2290 as a code owner September 3, 2018 16:07

sxjscience changed the title ~~[Bugfix] Fix flaky topk~~ [WIP][Bugfix] Fix flaky topk Sep 3, 2018

ankkhedia mentioned this pull request Sep 5, 2018

Fix flaky test : test_ndarray.test_order #12358

Merged

4 tasks

marcoabreu added the pr-awaiting-response PR is reviewed and waiting for contributor to respond label Sep 14, 2018

sxjscience commented Sep 15, 2018

View reviewed changes

ankkhedia mentioned this pull request Oct 10, 2018

Flaky test: test_ndarray.test_order #12310

Closed

sxjscience added 3 commits October 11, 2018 22:17

align pointers

4d98e4d

try to fix

c1dbcbe

Merge remote-tracking branch 'upstream/master' into fix_flaky_topk

ac35c64

sxjscience requested review from nswamy, szha and yzhliu as code owners October 11, 2018 16:03

sxjscience mentioned this pull request Oct 11, 2018

Fix Flaky Topk #12798

Merged

6 tasks

sxjscience closed this Oct 11, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP][Bugfix] Fix flaky topk #12446

[WIP][Bugfix] Fix flaky topk #12446

sxjscience commented Sep 3, 2018 •

edited

Loading

sxjscience commented Sep 3, 2018

marcoabreu commented Sep 3, 2018

sxjscience commented Sep 3, 2018 •

edited

Loading

sxjscience commented Sep 3, 2018 via email

ankkhedia commented Sep 12, 2018

kalyc commented Sep 13, 2018

sxjscience commented Sep 13, 2018 via email

kalyc commented Sep 14, 2018

sxjscience Sep 15, 2018

lupesko Sep 18, 2018

vandanavk Sep 25, 2018

sxjscience commented Sep 25, 2018 via email

lebeg commented Oct 1, 2018

[WIP][Bugfix] Fix flaky topk #12446

[WIP][Bugfix] Fix flaky topk #12446

Conversation

sxjscience commented Sep 3, 2018 • edited Loading

Description

Checklist

Essentials

Changes

Comments

sxjscience commented Sep 3, 2018

marcoabreu commented Sep 3, 2018

sxjscience commented Sep 3, 2018 • edited Loading

sxjscience commented Sep 3, 2018 via email

ankkhedia commented Sep 12, 2018

kalyc commented Sep 13, 2018

sxjscience commented Sep 13, 2018 via email

kalyc commented Sep 14, 2018

sxjscience Sep 15, 2018

Choose a reason for hiding this comment

lupesko Sep 18, 2018

Choose a reason for hiding this comment

vandanavk Sep 25, 2018

Choose a reason for hiding this comment

sxjscience commented Sep 25, 2018 via email

lebeg commented Oct 1, 2018

sxjscience commented Sep 3, 2018 •

edited

Loading

sxjscience commented Sep 3, 2018 •

edited

Loading