Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[WIP][Bugfix] Fix flaky topk #12446

Closed
wants to merge 4 commits into from
Closed

Conversation

sxjscience
Copy link
Member

@sxjscience sxjscience commented Sep 3, 2018

Description

This PR fixes the flaky topk test reported in #12358 , #12310. The previous bug is caused by not manually setting the dtype of mx.ndarrays when constructing them from numpy ndarrays (related issue: #12268). After reimplementing the IndexFill function, the test passes now.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • fix topk and its testing codes.

Comments

fix

try to fix test

try to fix flaky test

try to fix test

fix test

fix

fix
@sxjscience
Copy link
Member Author

@ankkhedia

@marcoabreu
Copy link
Contributor

Any reason you are removing int16 support?

@sxjscience
Copy link
Member Author

sxjscience commented Sep 3, 2018

@marcoabreu That's because the ndarray in MXNet does not support int16. So I removed it.

@sxjscience sxjscience changed the title [Bugfix] Fix flaky topk [WIP][Bugfix] Fix flaky topk Sep 3, 2018
@sxjscience
Copy link
Member Author

sxjscience commented Sep 3, 2018 via email

@ankkhedia
Copy link
Contributor

@sxjscience ping!

Did you get some chance to look into the failure? This might be required for fixing the flaky test I referenced.

@kalyc
Copy link
Contributor

kalyc commented Sep 13, 2018

@sxjscience could you please update the issue? Resolving a flaky test - #12358 is blocked because of this.

@sxjscience
Copy link
Member Author

sxjscience commented Sep 13, 2018 via email

@kalyc
Copy link
Contributor

kalyc commented Sep 14, 2018

@mxnet-label-bot[pr-awaiting-response]

@marcoabreu marcoabreu added the pr-awaiting-response PR is reviewed and waiting for contributor to respond label Sep 14, 2018
@@ -455,8 +457,7 @@ void TopKImpl(const RunContext &ctx,
// Cast `ret_indices` from int to real_t could introduce conversion error when the element_num
// is large enough.
if (param.ret_typ == topk_enum::kReturnMask) {
Tensor<xpu, 2, DType> ret_mask =
ret[0].get_with_shape<xpu, 2, DType>(Shape2(ret[0].Size(), 1), s);
Tensor<xpu, 1, DType> ret_mask = ret[0].FlatTo1D<xpu, DType>(s);
ret_mask = scalar<DType>(0);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now it raises a really weird "CUDA Misaligned Memory Error". I currently having no idea what triggers it. Actually it happens when we initialize the ret_mask to all zero.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anirudh2290 @azai91 @apeforest @samskalicky @eric-haibin-lin - maybe one of you guys can help?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sxjscience were you able to resolve the error?

@sxjscience
Copy link
Member Author

sxjscience commented Sep 25, 2018 via email

@lebeg
Copy link
Contributor

lebeg commented Oct 1, 2018

@sxjscience once you've fixed this, could you submit a PR to 1.3.x as well? Currently the branch build is broken due to the test_operator_gpu.test_order test: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/v1.3.x/40/pipeline

@sxjscience sxjscience mentioned this pull request Oct 11, 2018
6 tasks
@sxjscience sxjscience closed this Oct 11, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
pr-awaiting-response PR is reviewed and waiting for contributor to respond
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Inconsistent type conversion from numpy.ndarray to mx.ndarray
7 participants