[MXNET-564] Fix the Flaky test on arange #11377

lanking520 · 2018-06-23T22:20:30Z

Description

@nswamy @yzhliu @andrewfayres @anirudh2290
Please see the fix for the issue: #10387

it may help this as well: #8383

After discussion with @andrewfayres , we found the problems was in the Scala section, ~~the last value of the arange is larger than the stop value~~ multiple numbers in the Array are different. We think this should be a bug of Scala as the function is depreciated in the log. As @reminisce recommended, we can increase the minimum value of the step since there is nobody use step in a E-05 level. In summary, there are several ways we can solve this problem:

~~Change the precision level of arange into E-03 and remove the last value of the array.~~ Doesn't help, problem persist
~~Change the way that Scala works into While loops and set precision level to E-06~~ Doesn't help, problem still there...
~~Change the measurement from reid_diff into almost_equal which mostly used in Python with rtol = true and precision level E-04.~~ Doesn't help, problem still there
Use BigDecimal and while loop to test, Passed!
Run multiple times with a fixed number in the start, stop and step since the target in here is to check arange is doing well.

We cannot reproduce the same issue with python after 1M tests done by @haojin2 . Python use numpy which takes Float as Double in their calculation with high accuracy. This PR represent as the ticked solution shown above. Currently, I keep 100000 runs in the code to make sure CI will pass as well. Will consider remove that when we merge into master.

Generally speaking, this issue is really rare to come out. I tried 100k and the issue is 70% reproducible. With the amended solution, I ran 10 times 100k and all of them passed. Since we already have a concrete Python test for this operator, consider remove this in the nearer future.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
Changes are complete (i.e. I finished coding on this PR)
All changes have test coverage:
Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
Code is well-documented:
For user-facing API changes, API doc string has been updated.
For new C++ functions in header files, their functionalities and arguments are documented.
For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

lanking520 · 2018-06-25T02:58:57Z

Source to the changes on the number:

We use toString on the number that passed to the backend, there should be minimum losses.
The number return type are NDArray, the toArray method brings this: DType.Float32 => units.map(wrapBytes(_).getFloat), this should have minimum impact to the result
Scala addition method already changed to while loop, but it doesn't help
Not only the last number in the Scala Array influence the accuracy, there are more.

Guys, do you think this problem is more like a JVM and C difference question?

marcoabreu · 2018-06-27T23:09:54Z

Please also note that test_arange is failing in Python: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/1053/pipeline/

Could indicate a problem with the underlying implementation.

lanking520 · 2018-06-28T06:46:50Z

Hi @marcoabreu I think this issue is more related to CUDA, please see the message shown below:

Check failed: (err) == (cudaSuccess) Name: mxnet_generic_kernel ErrStr:unspecified launch failure

marcoabreu · 2018-06-28T08:03:02Z

Oops, my bad! Thanks for pointing it out. Then it's already tracked at #11395.

nswamy · 2018-06-28T08:25:16Z

scala-package/core/src/test/scala/org/apache/mxnet/CheckUtils.scala

@@ -30,4 +30,13 @@ object CheckUtils {
    val norm: Float = a.reduce(Math.abs(_) + Math.abs(_))
    diff / norm
  }
+
+  def almost_equal(a: Array[Float], b: Array[Float],


is this used anywhere, can we bring it when necessary

nswamy · 2018-06-28T08:25:20Z

scala-package/core/src/test/scala/org/apache/mxnet/OperatorSuite.scala

@@ -233,8 +233,13 @@ class OperatorSuite extends FunSuite with BeforeAndAfterAll
      val start = scala.util.Random.nextFloat() * 5


This is good find @lanking520 @andrewfayres, the precision is lost on the decimal part when it overflows with float. Declaring start, stop, step as doubles would just solve the problem. I would use just a double here instead of bigdecimal since its slow and unnecessary and this code pattern would soon start trickling to other tests. See the example that Andrew showed me with floats that doesn't work and max is very very close to upper bound with doubles.

scala> ( 3.969419d until 96.70541d by .00005197525d).flatMap(x=>Array.fill[Double](1)(x)).last res3: Double = 96.70537523623696

lanking520 · 2018-06-28T16:58:07Z

@nswamy Thanks Naveen for your feedback, I have made the changes accordingly. Tested on local with 100k runs.

* Fix Flaky Test - Change float to double to increase the precision of numbers generated in scala

lanking520 requested a review from yzhliu as a code owner June 23, 2018 22:20

lanking520 requested review from marcoabreu and szha as code owners June 26, 2018 06:36

lanking520 force-pushed the arange-bugfix branch from 84627a4 to e2f0330 Compare June 26, 2018 06:39

marcoabreu approved these changes Jun 28, 2018

View reviewed changes

nswamy suggested changes Jun 28, 2018

View reviewed changes

lanking520 added 7 commits June 28, 2018 09:56

add bug fix code

7751076

tried cases

9110f21

tried rtol

e9d3345

Big Decimal solution

9a3f8f3

change repeat = 1

d9e76f4

put back to normal 5 tests

4188598

change the float to double to increasse the precision

bf4ef50

lanking520 force-pushed the arange-bugfix branch from e2f0330 to bf4ef50 Compare June 28, 2018 16:56

szha removed their request for review June 28, 2018 23:33

nswamy merged commit c84a2f6 into apache:master Jun 29, 2018

lanking520 deleted the arange-bugfix branch June 29, 2018 21:28

lanking520 mentioned this pull request Jul 16, 2018

[MXNET-564] Fix the Flaky test on arange for NDArray #11766

Merged

5 tasks

XinYao1994 pushed a commit to XinYao1994/incubator-mxnet that referenced this pull request Aug 29, 2018

[MXNET-564] Fix the Flaky test on arange (apache#11377)

ed68aa2

* Fix Flaky Test - Change float to double to increase the precision of numbers generated in scala

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MXNET-564] Fix the Flaky test on arange #11377

[MXNET-564] Fix the Flaky test on arange #11377

lanking520 commented Jun 23, 2018 •

edited

Loading

lanking520 commented Jun 25, 2018

marcoabreu commented Jun 27, 2018

lanking520 commented Jun 28, 2018

marcoabreu commented Jun 28, 2018

nswamy Jun 28, 2018

nswamy Jun 28, 2018

lanking520 commented Jun 28, 2018

		@@ -233,8 +233,13 @@ class OperatorSuite extends FunSuite with BeforeAndAfterAll
		val start = scala.util.Random.nextFloat() * 5

[MXNET-564] Fix the Flaky test on arange #11377

[MXNET-564] Fix the Flaky test on arange #11377

Conversation

lanking520 commented Jun 23, 2018 • edited Loading

Description

Checklist

Essentials

lanking520 commented Jun 25, 2018

marcoabreu commented Jun 27, 2018

lanking520 commented Jun 28, 2018

marcoabreu commented Jun 28, 2018

nswamy Jun 28, 2018

Choose a reason for hiding this comment

nswamy Jun 28, 2018

Choose a reason for hiding this comment

lanking520 commented Jun 28, 2018

lanking520 commented Jun 23, 2018 •

edited

Loading