Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Python, Bug: Speech_recognition crashes in threaded_engine.h:379: Error: compute_ctc_loss, stat = execution failed #12024

Closed
David-Levinthal opened this issue Aug 3, 2018 · 14 comments
Labels
Backend Issues related to the backend of MXNet Bug Python

Comments

@David-Levinthal
Copy link

Any insights in how to fix this would be greatly appreciated

Description

Speech_recognition training on LibriSpeech crashes in threaded_engine.h:379: Error: compute_ctc_loss, stat = execution failed.

Environment info (Required)

ubuntu 16.04, cuda 9.2 Cudnn7.1.4, nccl 2.1.2 4 Nvidia V100s, CUDA_VISIBLE_DEVICES=0
deepspeech.cfg set to only use 1 GPU

What to do:
1. Download the diagnosis script from https://raw.githubusercontent.com/apache/incubator-mxnet/master/tools/diagnose.py
2. Run the script using `python diagnose.py` and paste its output here.
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                28
On-line CPU(s) list:   0-27
Thread(s) per core:    1
Core(s) per socket:    14
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz
Stepping:              1
CPU MHz:               2189.890
CPU max MHz:           3500.0000
CPU min MHz:           1200.0000
BogoMIPS:              5189.99
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              35840K
NUMA node0 CPU(s):     0-13
NUMA node1 CPU(s):     14-27
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb invpcid_single intel_pt ibrs ibpb stibp kaiser tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts
----------Python Info----------
('Version      :', '2.7.12')
('Compiler     :', 'GCC 5.4.0 20160609')
('Build        :', ('default', 'Dec  4 2017 14:50:18'))
('Arch         :', ('64bit', 'ELF'))
------------Pip Info-----------
('Version      :', '9.0.1')
('Directory    :', '/usr/local/lib/python2.7/dist-packages/pip')
----------MXNet Info-----------
('Version      :', '1.3.0')
('Directory    :', '/home/levinth/mxnet/python/mxnet')
Hashtag not found. Not installed from pre-built package.
----------System Info----------
('Platform     :', 'Linux-4.4.0-130-generic-x86_64-with-Ubuntu-16.04-xenial')
('system       :', 'Linux')
('node         :', 'zt-gpu-lin')
('release      :', '4.4.0-130-generic')
('version      :', '#156-Ubuntu SMP Thu Jun 14 08:53:28 UTC 2018')
----------Hardware Info----------
('machine      :', 'x86_64')
('processor    :', 'x86_64')
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0267 sec, LOAD: 0.4810 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0077 sec, LOAD: 0.4934 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.2532 sec, LOAD: 0.2584 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0342 sec, LOAD: 0.3129 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0159 sec, LOAD: 0.1362 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.0608 sec, LOAD: 1.6993 sec.

Package used (Python/R/Scala/Julia):
python

Build info (Required if built from source)

Compiler (gcc/clang/mingw/visual studio):
gcc/nvcc
gcc --version
gcc (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609

MXNet commit hash:
git rev-parse HEAD
1fa04f2

Build config:
make -j USE_BLAS=openblas USE_CUDA=1 USE_CUDA_PATH=/usr/local/cuda USE_CUDNN=1 USE_NCCL=1 USE_NCCL_PATH=/usr/local/nccl
config.mk is default except for uncommenting the 2 lines for including warp-ctc

Error Message:

[ INFO][2018/08/03 10:57:31.331] optimizer_params_dictionary = {"momentum":0.9}
[ INFO][2018/08/03 10:57:31.332] clip_gradient = 100
[ INFO][2018/08/03 10:57:31.332] weight_decay = 0.
[ INFO][2018/08/03 10:57:31.332]
[ INFO][2018/08/03 10:57:59.214] ---------train---------
[10:58:00] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:109: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
terminate called after throwing an instance of 'dmlc::Error'
what(): [10:58:03] src/engine/./threaded_engine.h:379: Error: compute_ctc_loss, stat = execution failed
A fatal error occurred in asynchronous engine operation. If you do not know what caused this error, you can try set environment variable MXNET_ENGINE_TYPE to NaiveEngine and run with debugger (i.e. gdb). This will force all operations to be synchronous and backtrace will give you the series of calls that lead to this error. Remember to set MXNET_ENGINE_TYPE back to empty after debugging.

Stack trace returned 9 entries:
[bt] (0) /home/levinth/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::StackTraceabi:cxx11+0x5b) [0x7f9dfe87440b]
[bt] (1) /home/levinth/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7f9dfe874f78]
[bt] (2) /home/levinth/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0xfa9) [0x7f9e015b3a59]
[bt] (3) /home/levinth/mxnet/python/mxnet/../../lib/libmxnet.so(void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>, std::shared_ptrdmlc::ManualEvent const&)+0xeb) [0x7f9e015c9d0b]
[bt] (4) /home/levinth/mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void (std::shared_ptrdmlc::ManualEvent), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock
, bool)::{lambda()#4}::operator()() const::{lambda(std::shared_ptrdmlc::ManualEvent)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptrdmlc::ManualEvent&&)+0x4e) [0x7f9e015c9f7e]
[bt] (5) /home/levinth/mxnet/python/mxnet/../../lib/libmxnet.so(std::thread::_Impl<std::_Bind_simple<std::function<void (std::shared_ptrdmlc::ManualEvent)> (std::shared_ptrdmlc::ManualEvent)> >::_M_run()+0x4a) [0x7f9e015b299a]
[bt] (6) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f9e4e10dc80]
[bt] (7) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f9e581196ba]
[bt] (8) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f9e57e4f41d]

Minimum reproducible example

default deepspeech with LibriSpeech data set prepared per instructions
train.py edited to use

#summary_writer = SummaryWriter(tblog_dir)
summary_writer = tf.summary.FileWriter(tblog_dir)

Steps to reproduce

see attachment for installation and invocation notes
mxnet_deepspeech_build_for_bug.txt

@Roshrini
Copy link
Member

Roshrini commented Aug 6, 2018

@anirudh2290 Can you please add labels: Python, Bug

@David-Levinthal David-Levinthal changed the title Speech_recognition crashes in threaded_engine.h:379: Error: compute_ctc_loss, stat = execution failed Python, Bug: Speech_recognition crashes in threaded_engine.h:379: Error: compute_ctc_loss, stat = execution failed Aug 6, 2018
@David-Levinthal
Copy link
Author

David-Levinthal commented Aug 6, 2018 via email

@nswamy nswamy added the Backend Issues related to the backend of MXNet label Aug 7, 2018
@vandanavk
Copy link
Contributor

I am in the process of replicating the issue - haven't yet got to the failure that you are seeing.

In the meantime, just had a couple of questions to try and understand the issue/replication steps better:
Why use tf.summary.FileWriter instead of SummaryWriter?
Also, have you tried mxboard's FileWriter/SummaryWriter in this example?
Any inputs on this would be helpful.

@vandanavk
Copy link
Contributor

@David-Levinthal have you had a chance to try out the solutions listed in #6121, #7002 or #6603?

@David-Levinthal
Copy link
Author

David-Levinthal commented Aug 17, 2018 via email

@vandanavk
Copy link
Contributor

Update: I don't see the issue with mxboard.

Changes required (vandanavk@d1bc989)

It's been executing for more than 1 hour now.

[    INFO][2018/08/18 00:37:36.313] clip_gradient = 100
[    INFO][2018/08/18 00:37:36.313] weight_decay = 0.
[    INFO][2018/08/18 00:37:36.313] 
[    INFO][2018/08/18 00:38:00.737] ---------train---------
[00:38:02] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:109: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
[    INFO][2018/08/18 00:52:28.763] label: and be like the self sacrificing heroines she loved to act 
[    INFO][2018/08/18 00:52:28.763] pred :       , cer: 0.864865 (distance: 32/ label length: 37)
[    INFO][2018/08/18 00:52:28.764] label: ain't that something done after you've done all that 
[    INFO][2018/08/18 00:52:28.765] pred :      , cer: 0.875000 (distance: 28/ label length: 32)
[    INFO][2018/08/18 00:52:28.766] label: so said the captain in a voice so stern it made joe wince 
[    INFO][2018/08/18 00:52:28.766] pred :     , cer: 0.921053 (distance: 35/ label length: 38)
[    INFO][2018/08/18 00:52:28.767] label: that my words seem to you utterly unnecessary and out of place 
[    INFO][2018/08/18 00:52:28.767] pred :       , cer: 0.875000 (distance: 35/ label length: 40)
[    INFO][2018/08/18 00:52:28.768] label: who dethroned his father you are welcome brave jason 
[    INFO][2018/08/18 00:52:28.768] pred :       , cer: 0.852941 (distance: 29/ label length: 34)
[    INFO][2018/08/18 00:52:28.770] label: several times that day as he perceived coulson's jealous sullenness 
[    INFO][2018/08/18 00:52:28.770] pred :         , cer: 0.829268 (distance: 34/ label length: 41)
[    INFO][2018/08/18 00:52:28.771] label: but she turned it off with assumed lightness oh yes 
[    INFO][2018/08/18 00:52:28.771] pred :       , cer: 0.848485 (distance: 28/ label length: 33)
[    INFO][2018/08/18 00:52:28.772] label: missus ludlow's mental motions were sufficiently various 
[    INFO][2018/08/18 00:52:28.772] pred :       , cer: 0.843750 (distance: 27/ label length: 32)
[    INFO][2018/08/18 00:52:28.774] label: the children said she did not shed one tear but prayed all the while 
[    INFO][2018/08/18 00:52:28.774] pred :       , cer: 0.888889 (distance: 40/ label length: 45)
[    INFO][2018/08/18 00:52:28.775] label: many that did ill under physicians hands have happily escaped 
[    INFO][2018/08/18 00:52:28.775] pred :      , cer: 0.894737 (distance: 34/ label length: 38)
[    INFO][2018/08/18 00:52:28.776] label: while his face assumed a hard determined expression 
[    INFO][2018/08/18 00:52:28.776] pred :       , cer: 0.838710 (distance: 26/ label length: 31)
[    INFO][2018/08/18 00:52:28.778] label: and white and furnished with light and heat hot or cold water if desired 

@David-Levinthal
Copy link
Author

David-Levinthal commented Aug 18, 2018 via email

@vandanavk
Copy link
Contributor

PR #12291 submitted

@vandanavk
Copy link
Contributor

@David-Levinthal The changes in the PR have been approved. Please verify if the changes work at your end too.

@David-Levinthal
Copy link
Author

David-Levinthal commented Aug 28, 2018 via email

@vandanavk
Copy link
Contributor

@David-Levinthal just wanted to followup to check if you still see this issue

@David-Levinthal
Copy link
Author

David-Levinthal commented Sep 25, 2018 via email

@vrakesh
Copy link
Contributor

vrakesh commented Oct 8, 2018

@David-Levinthal
requesting an update from your side, if the issue is fixed , we can close this issue

@vrakesh
Copy link
Contributor

vrakesh commented Dec 26, 2018

@David-Levinthal since the issue, has been resolved in a PR, I am requesting committers to close this issue, Feel free to reopen it if the error persists on your side

@sandeep-krishnamurthy Requesting to close this issue since it has been resolved in a PR

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Backend Issues related to the backend of MXNet Bug Python
Projects
None yet
Development

No branches or pull requests

6 participants