Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

~CuDevice cusolverDnDestroy core dump #4105

Open
522730312 opened this issue Jun 13, 2020 · 6 comments
Open

~CuDevice cusolverDnDestroy core dump #4105

522730312 opened this issue Jun 13, 2020 · 6 comments
Labels
bug stale-exclude Stale bot ignore this issue

Comments

@522730312
Copy link

522730312 commented Jun 13, 2020

I am using kaldi loss-function in pytorch by pybind and libkaldi-chain.so;
when test chain-loss on cuda10.0, core-dump happen.
There is nothing wrong in cuda9.0

(gdb) bt
#0 0x00007fc11842c1de in ?? () from /usr/local/cuda/lib64/libcusolver.so.10.0
#1 0x00007fc11844e4ca in ?? () from /usr/local/cuda/lib64/libcusolver.so.10.0
#2 0x00007fc1177a9e49 in ?? () from /usr/local/cuda/lib64/libcusolver.so.10.0
#3 0x00007fc1175784c1 in cusolverDnDestroy () from /usr/local/cuda/lib64/libcusolver.so.10.0
#4 0x00007fc15f077146 in kaldi::CuDevice::~CuDevice (this=0x3c64058, __in_chrg=) at cu-device.cc:683
#5 0x00007fc1e803bad1 in (anonymous namespace)::run (p=)
at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/libsupc++/atexit_thread.cc:75
#6 0x00007fc2171a4c99 in __run_exit_handlers () from /usr/lib64/libc.so.6
#7 0x00007fc2171a4ce7 in exit () from /usr/lib64/libc.so.6
#8 0x00007fc21718d50c in __libc_start_main () from /usr/lib64/libc.so.6
#9 0x0000000000400c20 in _start ()

@danpovey
Copy link
Contributor

danpovey commented Jun 13, 2020 via email

@522730312
Copy link
Author

522730312 commented Jun 14, 2020

@danpovey
Hi povery,
Here is my environment:Centos7 + cuda10 + GCC6 + pytorch1.5
As you can see the core-dump occurred in the vary last stage,~CuDevice() and libcusolver.so.10.0.
I try rebuild kaidi with c++14 which is same as pytorch 1.5, build success but test chain-loss fail.
I try Centos7 + cuda10 + GCC4.9 + pytorch1.2, test chain-loss success,so I think there are some compatibility issues in cuda10 + Kaldi + pytorch1.5.
I have no idea how to fix this.

@danpovey
Copy link
Contributor

danpovey commented Jun 14, 2020 via email

@kkm000
Copy link
Contributor

kkm000 commented Jun 22, 2020

@522, please let us know if you've got anywhere, and if so, what's your best guess what the problem was.

@522730312
Copy link
Author

@kkm000, I hvae not resolve this problem. please see this Issue in Torch.
pytorch/pytorch#40672

@kkm000 kkm000 added the stale-exclude Stale bot ignore this issue label Jul 20, 2020
@kkm000
Copy link
Contributor

kkm000 commented Jul 21, 2020

Looks like a case of static initialization order fiasco (or exitialization, in this case). That's tough. Apparently, torch destroys some internal CUDA RT objects too early so that the cusolver handle can't be safely destroyed any longer.

What if we put a try block around this code? We are shutting down the process in this code anyway. Yes, a hack, but at least, that will let the C runtime complete other cleanup safely.

@kkm000 kkm000 added bug and removed discussion labels Jul 21, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug stale-exclude Stale bot ignore this issue
Projects
None yet
Development

No branches or pull requests

3 participants