Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bohrium not usable via OpenMPI with missing fork support #600

Open
dionhaefner opened this issue Mar 1, 2019 · 2 comments
Open

Bohrium not usable via OpenMPI with missing fork support #600

dionhaefner opened this issue Mar 1, 2019 · 2 comments

Comments

@dionhaefner
Copy link
Collaborator

I tried running Bohrium on multiple nodes on the cluster, but it crashes with

pclose(): No such file or directory
pclose() failed.
terminate called after throwing an instance of 'std::runtime_error'
  what():  Compiler: pclose() failed
[node170:28446] *** Process received signal ***
[node170:28446] Signal: Aborted (6)
[node170:28446] Signal code:  (-6)
[node170:28446] [ 0] /lib64/libpthread.so.0[0x323a00f7e0]
[node170:28446] [ 1] /lib64/libc.so.6(gsignal+0x35)[0x32398324f5]
[node170:28446] [ 2] /lib64/libc.so.6(abort+0x175)[0x3239833cd5]
[node170:28446] [ 3] /groups/ocean/software/clBLAS/gcc/2.12/lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x15d)[0x2b10eb9fc5ad]
[node170:28446] [ 4] /groups/ocean/software/clBLAS/gcc/2.12/lib64/libstdc++.so.6(+0x8c636)[0x2b10eb9fa636]
[node170:28446] [ 5] /groups/ocean/software/clBLAS/gcc/2.12/lib64/libstdc++.so.6(+0x8c681)[0x2b10eb9fa681]
[node170:28446] [ 6] /groups/ocean/software/clBLAS/gcc/2.12/lib64/libstdc++.so.6(+0x8c898)[0x2b10eb9fa898]
[node170:28446] [ 7] /groups/ocean/software/bohrium/gcc/05102018/lib64/libbh.so(_ZNK7bohrium4jitk8Compiler7compileENSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEPKcm+0x20e)[0x2aaab2bdb0be]
[node170:28446] [ 8] /groups/ocean/software/bohrium/gcc/05102018/lib64/libbh_ve_openmp.so(_ZN7bohrium12EngineOpenMP11getFunctionERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES8_+0x5f6)[0x2aaabef0e0f6]
[node170:28446] [ 9] /groups/ocean/software/bohrium/gcc/05102018/lib64/libbh_ve_openmp.so(_ZN7bohrium12EngineOpenMP7executeERKNS_4jitk11SymbolTableERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEmRKSt6vectorIPK14bh_instructionSaISG_EE+0x187)[0x2aaabef0e597]
[node170:28446] [10] /groups/ocean/software/bohrium/gcc/05102018/lib64/libbh.so(_ZN7bohrium4jitk9EngineCPU15handleExecutionEP4BhIR+0x1a3c)[0x2aaab2c1ddbc]
[node170:28446] [11] /groups/ocean/software/bohrium/gcc/05102018/lib64/libbh_ve_openmp.so(+0x1b672)[0x2aaabef19672]
[node170:28446] [12] /groups/ocean/software/bohrium/gcc/05102018/lib64/libbh_vem_node.so(+0x43c6)[0x2aaabecfa3c6]
[node170:28446] [13] /groups/ocean/software/bohrium/gcc/05102018/lib64/libbhxx.so(+0x46f69)[0x2aaab2e92f69]
[node170:28446] [14] /groups/ocean/software/bohrium/gcc/05102018/lib64/libbhxx.so(_ZN4bhxx7Runtime5flushEv+0x37)[0x2aaab2e931c7]
[node170:28446] [15] /groups/ocean/software/bohrium/gcc/05102018/lib64/python2.7/site-packages/bohrium/_bh.so(PyFlush+0x9)[0x2aaab2699e69]
[node170:28446] [16] /groups/ocean/software/python/gcc/2.7.14/lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x74a6)[0x2b10aa352b56]
[node170:28446] [17] /groups/ocean/software/python/gcc/2.7.14/lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0xacbe)[0x2b10aa35636e]
[node170:28446] [18] /groups/ocean/software/python/gcc/2.7.14/lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0xe67)[0x2b10aa357c17]
[node170:28446] [19] /groups/ocean/software/python/gcc/2.7.14/lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0xaa2c)[0x2b10aa3560dc]
[node170:28446] [20] /groups/ocean/software/python/gcc/2.7.14/lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0xacbe)[0x2b10aa35636e]
[node170:28446] [21] /groups/ocean/software/python/gcc/2.7.14/lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0xe67)[0x2b10aa357c17]
[node170:28446] [22] /groups/ocean/software/python/gcc/2.7.14/lib/libpython2.7.so.1.0(+0x98cff)[0x2b10aa298cff]
[node170:28446] [23] /groups/ocean/software/python/gcc/2.7.14/lib/libpython2.7.so.1.0(PyObject_Call+0x47)[0x2b10aa2597c7]
[node170:28446] [24] /groups/ocean/software/python/gcc/2.7.14/lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x1d65)[0x2b10aa34d415]
[node170:28446] [25] /groups/ocean/software/python/gcc/2.7.14/lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0xe67)[0x2b10aa357c17]
[node170:28446] [26] /groups/ocean/software/python/gcc/2.7.14/lib/libpython2.7.so.1.0(+0x98cff)[0x2b10aa298cff]
[node170:28446] [27] /groups/ocean/software/python/gcc/2.7.14/lib/libpython2.7.so.1.0(PyObject_Call+0x47)[0x2b10aa2597c7]
[node170:28446] [28] /groups/ocean/software/python/gcc/2.7.14/lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x1d65)[0x2b10aa34d415]
[node170:28446] [29] /groups/ocean/software/python/gcc/2.7.14/lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0xe67)[0x2b10aa357c17]
[node170:28446] *** End of error message ***

This only happens when using more than 1 node (multiple processes on the same node work fine), so it might be another filesystem issue (#598)?

I tried disabling the persistent cache, to no avail.

@dionhaefner
Copy link
Collaborator Author

dionhaefner commented Mar 4, 2019

I did some digging. The process runs fine for several hundred kernels, then the JIT call to GCC fails for no reason. The kernel where compilation fails is unremarkable and varies between runs.

I noticed the following warning when starting the run:

--------------------------------------------------------------------------
A process has executed an operation involving a call to the
"fork()" system call to create a child process.  Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your job may hang, crash, or produce silent
data corruption.  The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.

The process that invoked fork was:

  Local host:          [[14689,0],0] (PID 32590)

If you are *absolutely sure* that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.
--------------------------------------------------------------------------

It seems like fork and thus subprocess::Popen is not supported from applications running through MPI. Could this be causing the problem?

@dionhaefner
Copy link
Collaborator Author

Works through MVAPICH2 instead of OpenMPI, so we can work with that. On the horizon, OpenMPI support would be nice though.

@dionhaefner dionhaefner changed the title Bohrium crashes when running on multiple nodes Bohrium not usable via OpenMPI with missing fork support Mar 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant