-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Hanging flaky test test_operator.test_norm @ Python 3: GPU Win #11509
Comments
On other occasions, it causes assertion errors:
|
I will look into it |
The latter error could be due to that numpy.linalg.norm is not numerically stable (neither is the mxnet one). Possibly #11573 could fix this issue. |
FYI @anirudhacharya the test has been partially re-enabled in #11573. |
@szha Would it make sense to close this issue and re-open if we see test issuse with test_norm again? |
@KellenSunderland as Sheng mentioned it has only been partially enabled, there are certain test cases which are still not enabled and fix that was merged does not fix the occasional hanging of the test case on the CI. |
@anirudh2290 @KellenSunderland Only the part that caused assertion errors during numerical checking of the gradient is disabled. It is not clear if the hang was occurring during You may want to rename this issue to make it clear that currently a part of the test is disabled due to numerical instability of the |
|
I'm looking into the issue now. |
I can't reproduce this on p3 and p2 instances so far. I keep running this test in a loop for now. Do the tests run on a different instance type? |
Hao and I weren't able to reproduce the error on the (supposedly) same instance type that CI slaves use, using the same docker building logic. |
G3 Are you running with our docker setup? |
yes, Sheng and I were using g3.8xlarge + instructions from https://cwiki.apache.org/confluence/display/MXNET/Reproducing+test+results |
@marcoabreu is it feasible to restart the CI Pipeline with the same global seed (not only test seed)? Jenkins should make it feasible to restart the pipeline when accessing http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-11482/24/pipeline/861/ while being logged in, but I'm not sure about the setup for global seed. |
I don't think this is feasible, considering the setup is entirely reproducible with docker. Also the costs are not worth the effort. I might be stating the obvious, but did you use the latest master or the commit at that time to reproduce the error? |
@marcoabreu Hao and I have been using my branch and I imagine @leezu is doing the same. |
Update on this issue - I ran this test case ~5k times for different scenarios on a Linux GPU machine (EC2 p2.8x large instance). Currently the reasons for test failure are -
The above two issues are only reproducible when run against CPU context. When run against GPU the test neither hangs nor does it have accuracy issues. I need to check the same with a Windows instance. |
With regards to the hanging It is not reproducible when I run it in a single-threaded environment with It only hangs in a multi-threaded environment, due to resource contention. I have been using The strace logs have recurring occurrences of the following system calls from competing threads - 101312 futex(0x22c1f90, FUTEX_WAKE_PRIVATE, 1) = 0
101169 futex(0x3b1f5bc, FUTEX_WAIT_PRIVATE, 6862611, NULL <unfinished ...>
101312 futex(0x7f19a4005ce4, FUTEX_WAKE_PRIVATE, 2147483647) = 0
101341 futex(0x7f19a4015194, FUTEX_WAKE_PRIVATE, 2147483647 <unfinished ...>
101312 futex(0x3b1f5bc, FUTEX_CMP_REQUEUE_PRIVATE, 1, 2147483647, 0x3b1f590, 6862612 <unfinished ...>
101341 <... futex resumed> ) = 0
101169 <... futex resumed> ) = 0
101312 <... futex resumed> ) = 1 |
test_operator.test_norm causes hangs at Python 3: GPU Win.
http://jenkins.mxnet-ci.amazon-ml.com/view/Master%20Jobs/job/incubator-mxnet/job/master/1109/
The text was updated successfully, but these errors were encountered: