-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Flaky test: test_operator_gpu.test_sequence_last causes 'CUDA: unspecified launch failure' #11395
Comments
In a private communication, you indicated this was seen on all platforms. Here you tag it as 'Windows'. Please clarify. |
Sorry Dick, I just double checked my database and it seems to only happen on Windows. It seems like I mixed something up, please excuse me for that. Config: Windows Server 2016, G3.8xlarge, CUDA8, unknown driver version |
Another Jenkins log for flaky failure of test_operator_gpu.test_op_roi_align - http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/incubator-mxnet/branches/PR-10889/runs/17/nodes/751/log/?start=0 |
This seems to be caused by test_conv:
|
I have a lead on the problem. There is an out-of-bound read performed by the SequenceLastKernel. I'll stop here and let the person responsible for this kernel correct the problem. Kernels that read beyond their valid input tensor regions can be problematic, even if the random data read is never used in a subsequent kernel write. The problem surfaces when the reads are outside of valid mapped address ranges, which results in an unservicable TLB miss. The problems can be non-deterministic since the input tensors may have non-deterministic placement within their mapped pages. I debugged the problem by going to the first test that showed the failure in one of the above posts, captured the MXNET_TEST_SEED, and then reproduced the error (on Linux no less) with the following command:
|
I'm still blocked by this. Was it really disabled? |
I don't think anyone disabled the test. Opening a PR to do so here: #11485 |
@DickJC123 Thank you for investigating this issue! I am not able to reproduce the test failure pointed by you with the commit : b786ead with the same test seed: 731510245 . Are you able to reproduce the issue with the latest master ? |
(I'll answer on behalf of dick because he told me that he will be quite busy around this time) |
yes i am not able to similar errors with cuda memcheck either. |
Tried the following:
Also did:
couldn't reproduce |
How often did you run? |
ran repeatedly around 2k times. |
Is this related? CUDA: unspecified launch failure- http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fwindows-gpu/detail/PR-17444/2/pipeline/ |
Another windows gpu cuda launch error : http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Fwindows-gpu/detail/PR-17444/4/pipeline #17444 |
Unspecified launch failure here: What's the workaround? Need to a get PR through that has nothing to do with CUDA... |
I haven't found one. I kept retriggering and finally gave up. |
Again here.... and another I just restarted... |
Sometimes, our slaves get corrupted and suddenly all test start to fail. This is unrelated to the tests directly.
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-11377/5/pipeline/
The text was updated successfully, but these errors were encountered: