-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
~CuDevice cusolverDnDestroy core dump #4105
Comments
There are lots of compatibility issues to consider here.
E.g. is the CUDA toolkit version and ABI (CXX11?) the same with how PyTorch
was compiled?
Make sure Kaldi itself is compiled from `make clean` if you change CUDA
version.
…On Sat, Jun 13, 2020 at 8:21 PM vinda ***@***.***> wrote:
I am using kaldi loss-function in pytorch use pybind and libkaldi-chain.so;
when test chain-loss on cuda10.0, core-dump happen.
There is nothing wrong in cuda9.0
(gdb) bt
#0 0x00007fc11842c1de in ?? () from
/usr/local/cuda/lib64/libcusolver.so.10.0
#1 <#1> 0x00007fc11844e4ca in ??
() from /usr/local/cuda/lib64/libcusolver.so.10.0
#2 <#2> 0x00007fc1177a9e49 in ??
() from /usr/local/cuda/lib64/libcusolver.so.10.0
#3 <#3> 0x00007fc1175784c1 in
cusolverDnDestroy () from /usr/local/cuda/lib64/libcusolver.so.10.0
#4 <#4> 0x00007fc15f077146 in
kaldi::CuDevice::~CuDevice (this=0x3c64058, __in_chrg=) at cu-device.cc:683
#5 <#5> 0x00007fc1e803bad1 in
(anonymous namespace)::run (p=)
at
/home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/libsupc++/atexit_thread.cc:75
#6 <#6> 0x00007fc2171a4c99 in
__run_exit_handlers () from /usr/lib64/libc.so.6
#7 <#7> 0x00007fc2171a4ce7 in exit
() from /usr/lib64/libc.so.6
#8 <#8> 0x00007fc21718d50c in
__libc_start_main () from /usr/lib64/libc.so.6
#9 <#9> 0x0000000000400c20 in
_start ()
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#4105>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLO2K445OWM7C4U2GYWDRWNVNHANCNFSM4N466FAA>
.
|
@danpovey |
You probably need to make sure you use the PyTorch version that was built
for cuda 10.
I think 1.5 has various flavors.
https://pytorch.org/
…On Sun, Jun 14, 2020 at 5:56 PM vinda ***@***.***> wrote:
@danpovey <https://github.com/danpovey>
Hi povery,
Here is my environment:Centos7 + cuda10 + GCC6 + pytorch1.5
As you can see the core-dump occurred in the vary last stage,~CuDevice()
and libcusolver.so.10.0.
I try rebuild kaidi with c++14 which is same as pytorch 1.5, but it failed.
I try Centos7 + cuda10 + GCC4.9 + pytorch1.2, it success,so I think there
are some compatibility issues in cuda10 + Kaldi + pytorch1.5.
I have no idea how to fix this.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#4105 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZFLOY3ZIADK4CSSSZOEZ3RWSNDTANCNFSM4N466FAA>
.
|
@522, please let us know if you've got anywhere, and if so, what's your best guess what the problem was. |
@kkm000, I hvae not resolve this problem. please see this Issue in Torch. |
Looks like a case of static initialization order fiasco (or exitialization, in this case). That's tough. Apparently, torch destroys some internal CUDA RT objects too early so that the cusolver handle can't be safely destroyed any longer. What if we put a try block around this code? We are shutting down the process in this code anyway. Yes, a hack, but at least, that will let the C runtime complete other cleanup safely. |
I am using kaldi loss-function in pytorch by pybind and libkaldi-chain.so;
when test chain-loss on cuda10.0, core-dump happen.
There is nothing wrong in cuda9.0
(gdb) bt
#0 0x00007fc11842c1de in ?? () from /usr/local/cuda/lib64/libcusolver.so.10.0
#1 0x00007fc11844e4ca in ?? () from /usr/local/cuda/lib64/libcusolver.so.10.0
#2 0x00007fc1177a9e49 in ?? () from /usr/local/cuda/lib64/libcusolver.so.10.0
#3 0x00007fc1175784c1 in cusolverDnDestroy () from /usr/local/cuda/lib64/libcusolver.so.10.0
#4 0x00007fc15f077146 in kaldi::CuDevice::~CuDevice (this=0x3c64058, __in_chrg=) at cu-device.cc:683
#5 0x00007fc1e803bad1 in (anonymous namespace)::run (p=)
at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/libsupc++/atexit_thread.cc:75
#6 0x00007fc2171a4c99 in __run_exit_handlers () from /usr/lib64/libc.so.6
#7 0x00007fc2171a4ce7 in exit () from /usr/lib64/libc.so.6
#8 0x00007fc21718d50c in __libc_start_main () from /usr/lib64/libc.so.6
#9 0x0000000000400c20 in _start ()
The text was updated successfully, but these errors were encountered: