Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Windows segmentation faults in GPU tests #17635

Open
larroy opened this issue Feb 20, 2020 · 4 comments
Open

Windows segmentation faults in GPU tests #17635

larroy opened this issue Feb 20, 2020 · 4 comments
Labels

Comments

@larroy
Copy link
Contributor

larroy commented Feb 20, 2020

Description

Windows GPU tests from the updated environment in https://github.com/aiengines/ci fails with the following:

======================================================================
ERROR: Failure: OSError (exception: access violation writing 0x0000000000000000)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "C:\Python37\lib\site-packages\nose\failure.py", line 39, in runTest
    raise self.exc_val.with_traceback(self.tb)
  File "C:\Python37\lib\site-packages\nose\loader.py", line 418, in loadTestsFromName
    addr.filename, addr.module)
  File "C:\Python37\lib\site-packages\nose\importer.py", line 47, in importFromPath
    return self.importFromDir(dir_path, fqname)
  File "C:\Python37\lib\site-packages\nose\importer.py", line 94, in importFromDir
    mod = load_module(part_fqname, fh, filename, desc)
  File "C:\Python37\lib\imp.py", line 235, in load_module
    return load_source(name, filename, file)
  File "C:\Python37\lib\imp.py", line 172, in load_source
    module = _load(spec)
  File "<frozen importlib._bootstrap>", line 696, in _load
  File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 728, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "C:\Users\Administrator\mxnet\tests\python\unittest\test_test_utils.py", line 21, in <module>
    import mxnet as mx
  File "C:\Users\Administrator\mxnet\windows_package\python\mxnet\__init__.py", line 33, in <module>
    from . import contrib
  File "C:\Users\Administrator\mxnet\windows_package\python\mxnet\contrib\__init__.py", line 27, in <module>
    from . import autograd
  File "C:\Users\Administrator\mxnet\windows_package\python\mxnet\contrib\autograd.py", line 27, in <module>
    from ..ndarray import NDArray, zeros_like, _GRAD_REQ_MAP
  File "C:\Users\Administrator\mxnet\windows_package\python\mxnet\ndarray\__init__.py", line 20, in <module>
    from . import _internal, contrib, linalg, op, random, sparse, utils, image, ndarray, numpy
  File "C:\Users\Administrator\mxnet\windows_package\python\mxnet\ndarray\numpy\__init__.py", line 23, in <module>
    from . import _register
  File "C:\Users\Administrator\mxnet\windows_package\python\mxnet\ndarray\numpy\_register.py", line 21, in <module>
    from ..register import _make_ndarray_function
  File "C:\Users\Administrator\mxnet\windows_package\python\mxnet\ndarray\register.py", line 277, in <module>
    _init_op_module('mxnet', 'ndarray', _make_ndarray_function)
  File "C:\Users\Administrator\mxnet\windows_package\python\mxnet\base.py", line 682, in _init_op_module
    ctypes.byref(plist)))
OSError: exception: access violation writing 0x0000000000000000
 
======================================================================
ERROR: Failure: OSError (exception: access violation writing 0x0000000000000000)
----------------------------------------------------------------------
Traceback (most recent call last):

Error Message

(Paste the complete error message. Please also include stack trace by setting environment variable DMLC_LOG_STACK_TRACE_DEPTH=10 before running your script.)

To Reproduce

Create an AMI with the provided scripts, compile and run GPU tests.

Steps to reproduce

(Paste the commands you ran that produced the error.)

What have you tried to solve it?

Environment

We recommend using our script for collecting the diagnositc information. Run the following command and paste the outputs below:

curl --retry 10 -s https://raw.githubusercontent.com/dmlc/gluon-nlp/master/tools/diagnose.py | python

# paste outputs here
@larroy larroy added the Bug label Feb 20, 2020
@vexilligera
Copy link
Contributor

I was able to run the script ci/windows/test_py3_gpu.ps1 properly and did not run into this error with the setup script provided on a local Windows machine.

@larroy
Copy link
Contributor Author

larroy commented Feb 20, 2020

Ok, that's surprising. what build did you do before running the tests?

@vexilligera
Copy link
Contributor

Ok, that's surprising. what build did you do before running the tests?

I simply ran pre_setup.ps1 and setup.ps1. Then I cloned MXNet and run py -3 ci/build_windows.py -f WIN_GPU and went into the python folder to install with python setup.py install.

I was able to import mxnet properly and the test script worked fine.

@vexilligera
Copy link
Contributor

It turns out that the CUDA architecture detection probably failed, if you go with the 5.2 in the script it does give you mxnet_70.dll on P3 instance, but the binary might be wrong. When I set '-DMXNET_CUDA_ARCH="7.0” ‘ things work properly.

Still pretty weird since yesterday when I was trying this with Clang it gives me the same segmentation fault. I'll probably dig deeper into this if this happens again on a fresh P3 instance when I try to integrate TVM ops.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants