~CuDevice cusolverDnDestroy core dump #4105

522730312 · 2020-06-13T12:21:15Z

I am using kaldi loss-function in pytorch by pybind and libkaldi-chain.so;
when test chain-loss on cuda10.0, core-dump happen.
There is nothing wrong in cuda9.0

(gdb) bt
#0 0x00007fc11842c1de in ?? () from /usr/local/cuda/lib64/libcusolver.so.10.0
#1 0x00007fc11844e4ca in ?? () from /usr/local/cuda/lib64/libcusolver.so.10.0
#2 0x00007fc1177a9e49 in ?? () from /usr/local/cuda/lib64/libcusolver.so.10.0
#3 0x00007fc1175784c1 in cusolverDnDestroy () from /usr/local/cuda/lib64/libcusolver.so.10.0
#4 0x00007fc15f077146 in kaldi::CuDevice::~CuDevice (this=0x3c64058, __in_chrg=) at cu-device.cc:683
#5 0x00007fc1e803bad1 in (anonymous namespace)::run (p=)
at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/libsupc++/atexit_thread.cc:75
#6 0x00007fc2171a4c99 in __run_exit_handlers () from /usr/lib64/libc.so.6
#7 0x00007fc2171a4ce7 in exit () from /usr/lib64/libc.so.6
#8 0x00007fc21718d50c in __libc_start_main () from /usr/lib64/libc.so.6
#9 0x0000000000400c20 in _start ()

danpovey · 2020-06-13T16:58:42Z

There are lots of compatibility issues to consider here. E.g. is the CUDA toolkit version and ABI (CXX11?) the same with how PyTorch was compiled? Make sure Kaldi itself is compiled from `make clean` if you change CUDA version.

…

On Sat, Jun 13, 2020 at 8:21 PM vinda ***@***.***> wrote: I am using kaldi loss-function in pytorch use pybind and libkaldi-chain.so; when test chain-loss on cuda10.0, core-dump happen. There is nothing wrong in cuda9.0 (gdb) bt #0 0x00007fc11842c1de in ?? () from /usr/local/cuda/lib64/libcusolver.so.10.0 #1 <#1> 0x00007fc11844e4ca in ?? () from /usr/local/cuda/lib64/libcusolver.so.10.0 #2 <#2> 0x00007fc1177a9e49 in ?? () from /usr/local/cuda/lib64/libcusolver.so.10.0 #3 <#3> 0x00007fc1175784c1 in cusolverDnDestroy () from /usr/local/cuda/lib64/libcusolver.so.10.0 #4 <#4> 0x00007fc15f077146 in kaldi::CuDevice::~CuDevice (this=0x3c64058, __in_chrg=) at cu-device.cc:683 #5 <#5> 0x00007fc1e803bad1 in (anonymous namespace)::run (p=) at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/libsupc++/atexit_thread.cc:75 #6 <#6> 0x00007fc2171a4c99 in __run_exit_handlers () from /usr/lib64/libc.so.6 #7 <#7> 0x00007fc2171a4ce7 in exit () from /usr/lib64/libc.so.6 #8 <#8> 0x00007fc21718d50c in __libc_start_main () from /usr/lib64/libc.so.6 #9 <#9> 0x0000000000400c20 in _start () — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#4105>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZFLO2K445OWM7C4U2GYWDRWNVNHANCNFSM4N466FAA> .

522730312 · 2020-06-14T09:55:57Z

@danpovey
Hi povery,
Here is my environment：Centos7 + cuda10 + GCC6 + pytorch1.5
As you can see the core-dump occurred in the vary last stage,~CuDevice() and libcusolver.so.10.0.
I try rebuild kaidi with c++14 which is same as pytorch 1.5, build success but test chain-loss fail.
I try Centos7 + cuda10 + GCC4.9 + pytorch1.2, test chain-loss success，so I think there are some compatibility issues in cuda10 + Kaldi + pytorch1.5.
I have no idea how to fix this.

danpovey · 2020-06-14T10:49:49Z

You probably need to make sure you use the PyTorch version that was built for cuda 10. I think 1.5 has various flavors. https://pytorch.org/

…

On Sun, Jun 14, 2020 at 5:56 PM vinda ***@***.***> wrote: @danpovey <https://github.com/danpovey> Hi povery, Here is my environment：Centos7 + cuda10 + GCC6 + pytorch1.5 As you can see the core-dump occurred in the vary last stage,~CuDevice() and libcusolver.so.10.0. I try rebuild kaidi with c++14 which is same as pytorch 1.5, but it failed. I try Centos7 + cuda10 + GCC4.9 + pytorch1.2, it success，so I think there are some compatibility issues in cuda10 + Kaldi + pytorch1.5. I have no idea how to fix this. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4105 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZFLOY3ZIADK4CSSSZOEZ3RWSNDTANCNFSM4N466FAA> .

kkm000 · 2020-06-22T08:50:58Z

@522, please let us know if you've got anywhere, and if so, what's your best guess what the problem was.

522730312 · 2020-06-28T06:04:25Z

@kkm000, I hvae not resolve this problem. please see this Issue in Torch.
pytorch/pytorch#40672

kkm000 · 2020-07-21T08:42:21Z

Looks like a case of static initialization order fiasco (or exitialization, in this case). That's tough. Apparently, torch destroys some internal CUDA RT objects too early so that the cusolver handle can't be safely destroyed any longer.

What if we put a try block around this code? We are shutting down the process in this code anyway. Yes, a hack, but at least, that will let the C runtime complete other cleanup safely.

522730312 added the discussion label Jun 13, 2020

kkm000 added the stale-exclude Stale bot ignore this issue label Jul 20, 2020

kkm000 added bug and removed discussion labels Jul 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

~CuDevice cusolverDnDestroy core dump #4105

~CuDevice cusolverDnDestroy core dump #4105

522730312 commented Jun 13, 2020 •

edited

Loading

danpovey commented Jun 13, 2020 via email

522730312 commented Jun 14, 2020 •

edited

Loading

danpovey commented Jun 14, 2020 via email

kkm000 commented Jun 22, 2020

522730312 commented Jun 28, 2020

kkm000 commented Jul 21, 2020 •

edited

Loading

~CuDevice cusolverDnDestroy core dump #4105

~CuDevice cusolverDnDestroy core dump #4105

Comments

522730312 commented Jun 13, 2020 • edited Loading

danpovey commented Jun 13, 2020 via email

522730312 commented Jun 14, 2020 • edited Loading

danpovey commented Jun 14, 2020 via email

kkm000 commented Jun 22, 2020

522730312 commented Jun 28, 2020

kkm000 commented Jul 21, 2020 • edited Loading

522730312 commented Jun 13, 2020 •

edited

Loading

522730312 commented Jun 14, 2020 •

edited

Loading

kkm000 commented Jul 21, 2020 •

edited

Loading