-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Python, Bug: Speech_recognition crashes in threaded_engine.h:379: Error: compute_ctc_loss, stat = execution failed #12024
Comments
@anirudh2290 Can you please add labels: Python, Bug |
done
…On Mon, Aug 6, 2018 at 11:52 AM, Roshani Nagmote ***@***.***> wrote:
@anirudh2290 <https://github.com/anirudh2290> Can you please add labels:
Python, Bug
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#12024 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AIUuT0CnCD3ui7TKQVycJ2eVl5x3Krztks5uOJBXgaJpZM4VuhrR>
.
|
I am in the process of replicating the issue - haven't yet got to the failure that you are seeing. In the meantime, just had a couple of questions to try and understand the issue/replication steps better: |
@David-Levinthal have you had a chance to try out the solutions listed in #6121, #7002 or #6603? |
SummaryWriter is deprecated. Use an R1.9 TF installation for the source
I used TF as I have source builds
if mxboard is required perhaps a minor change to the readme for
installation would be a solution.
d
…On Thu, Aug 16, 2018 at 11:42 PM, Vandana Kannan ***@***.***> wrote:
I am in the process of replicating the issue - haven't yet got to the
failure that you are seeing.
In the meantime, just had a couple of questions to try and understand the
issue/replication steps better:
Why use tf.summary.FileWriter instead of SummaryWriter?
Also, have you tried mxboard's FileWriter/SummaryWriter in this example?
Any inputs on this would be helpful.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#12024 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AIUuTzzq1aBDwgfWPed-r1JG_3JTKeMRks5uRmXRgaJpZM4VuhrR>
.
|
Update: I don't see the issue with mxboard. Changes required (vandanavk@d1bc989) It's been executing for more than 1 hour now.
|
as has been stated:
the issue is with using tensorflow/tensorboard..
.if mxboard is required.then the instructions need to say that
…On Fri, Aug 17, 2018 at 6:48 PM, Vandana Kannan ***@***.***> wrote:
Update: I don't see the issue with mxboard.
Changes required ***@***.***
<vandanavk@d1bc989>
)
It's been executing for more than 1 hour now.
[ INFO][2018/08/18 00:37:36.313] clip_gradient = 100
[ INFO][2018/08/18 00:37:36.313] weight_decay = 0.
[ INFO][2018/08/18 00:37:36.313]
[ INFO][2018/08/18 00:38:00.737] ---------train---------
[00:38:02] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:109: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
[ INFO][2018/08/18 00:52:28.763] label: and be like the self sacrificing heroines she loved to act
[ INFO][2018/08/18 00:52:28.763] pred : , cer: 0.864865 (distance: 32/ label length: 37)
[ INFO][2018/08/18 00:52:28.764] label: ain't that something done after you've done all that
[ INFO][2018/08/18 00:52:28.765] pred : , cer: 0.875000 (distance: 28/ label length: 32)
[ INFO][2018/08/18 00:52:28.766] label: so said the captain in a voice so stern it made joe wince
[ INFO][2018/08/18 00:52:28.766] pred : , cer: 0.921053 (distance: 35/ label length: 38)
[ INFO][2018/08/18 00:52:28.767] label: that my words seem to you utterly unnecessary and out of place
[ INFO][2018/08/18 00:52:28.767] pred : , cer: 0.875000 (distance: 35/ label length: 40)
[ INFO][2018/08/18 00:52:28.768] label: who dethroned his father you are welcome brave jason
[ INFO][2018/08/18 00:52:28.768] pred : , cer: 0.852941 (distance: 29/ label length: 34)
[ INFO][2018/08/18 00:52:28.770] label: several times that day as he perceived coulson's jealous sullenness
[ INFO][2018/08/18 00:52:28.770] pred : , cer: 0.829268 (distance: 34/ label length: 41)
[ INFO][2018/08/18 00:52:28.771] label: but she turned it off with assumed lightness oh yes
[ INFO][2018/08/18 00:52:28.771] pred : , cer: 0.848485 (distance: 28/ label length: 33)
[ INFO][2018/08/18 00:52:28.772] label: missus ludlow's mental motions were sufficiently various
[ INFO][2018/08/18 00:52:28.772] pred : , cer: 0.843750 (distance: 27/ label length: 32)
[ INFO][2018/08/18 00:52:28.774] label: the children said she did not shed one tear but prayed all the while
[ INFO][2018/08/18 00:52:28.774] pred : , cer: 0.888889 (distance: 40/ label length: 45)
[ INFO][2018/08/18 00:52:28.775] label: many that did ill under physicians hands have happily escaped
[ INFO][2018/08/18 00:52:28.775] pred : , cer: 0.894737 (distance: 34/ label length: 38)
[ INFO][2018/08/18 00:52:28.776] label: while his face assumed a hard determined expression
[ INFO][2018/08/18 00:52:28.776] pred : , cer: 0.838710 (distance: 26/ label length: 31)
[ INFO][2018/08/18 00:52:28.778] label: and white and furnished with light and heat hot or cold water if desired
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#12024 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AIUuT3ZrTX3kKZtHIuNm8uk9kbDfxKhzks5uR3KGgaJpZM4VuhrR>
.
|
PR #12291 submitted |
@David-Levinthal The changes in the PR have been approved. Please verify if the changes work at your end too. |
may take a few days...my lab was moved to a new building...
:-)
what fun!
:-(
…On Tue, Aug 28, 2018 at 7:46 AM Vandana Kannan ***@***.***> wrote:
@David-Levinthal <https://github.com/David-Levinthal> The changes in the
PR have been approved. Please verify if the changes work at your end too.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#12024 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AIUuT7v_cxzkOyLhVzXWpAb0EGG3XRNbks5uVVfSgaJpZM4VuhrR>
.
|
@David-Levinthal just wanted to followup to check if you still see this issue |
sorry..seriously overwhelmed..know a good cloning lab?
d
…On Mon, Sep 24, 2018 at 10:03 PM Vandana Kannan ***@***.***> wrote:
@David-Levinthal <https://github.com/David-Levinthal> just wanted to
followup to check if you still see this issue
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#12024 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AIUuT6CQ9OgTxolOVAFJ5_97aFFGwwWBks5uebk1gaJpZM4VuhrR>
.
|
@David-Levinthal |
@David-Levinthal since the issue, has been resolved in a PR, I am requesting committers to close this issue, Feel free to reopen it if the error persists on your side @sandeep-krishnamurthy Requesting to close this issue since it has been resolved in a PR |
Any insights in how to fix this would be greatly appreciated
Description
Speech_recognition training on LibriSpeech crashes in threaded_engine.h:379: Error: compute_ctc_loss, stat = execution failed.
Environment info (Required)
ubuntu 16.04, cuda 9.2 Cudnn7.1.4, nccl 2.1.2 4 Nvidia V100s, CUDA_VISIBLE_DEVICES=0
deepspeech.cfg set to only use 1 GPU
Package used (Python/R/Scala/Julia):
python
Build info (Required if built from source)
Compiler (gcc/clang/mingw/visual studio):
gcc/nvcc
gcc --version
gcc (Ubuntu 5.4.0-6ubuntu1~16.04.10) 5.4.0 20160609
MXNet commit hash:
git rev-parse HEAD
1fa04f2
Build config:
make -j USE_BLAS=openblas USE_CUDA=1 USE_CUDA_PATH=/usr/local/cuda USE_CUDNN=1 USE_NCCL=1 USE_NCCL_PATH=/usr/local/nccl
config.mk is default except for uncommenting the 2 lines for including warp-ctc
Error Message:
[ INFO][2018/08/03 10:57:31.331] optimizer_params_dictionary = {"momentum":0.9}
[ INFO][2018/08/03 10:57:31.332] clip_gradient = 100
[ INFO][2018/08/03 10:57:31.332] weight_decay = 0.
[ INFO][2018/08/03 10:57:31.332]
[ INFO][2018/08/03 10:57:59.214] ---------train---------
[10:58:00] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:109: Running performance tests to find the best convolution algorithm, this can take a while... (setting env variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
terminate called after throwing an instance of 'dmlc::Error'
what(): [10:58:03] src/engine/./threaded_engine.h:379: Error: compute_ctc_loss, stat = execution failed
A fatal error occurred in asynchronous engine operation. If you do not know what caused this error, you can try set environment variable MXNET_ENGINE_TYPE to NaiveEngine and run with debugger (i.e. gdb). This will force all operations to be synchronous and backtrace will give you the series of calls that lead to this error. Remember to set MXNET_ENGINE_TYPE back to empty after debugging.
Stack trace returned 9 entries:
[bt] (0) /home/levinth/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::StackTraceabi:cxx11+0x5b) [0x7f9dfe87440b]
[bt] (1) /home/levinth/mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x28) [0x7f9dfe874f78]
[bt] (2) /home/levinth/mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*)+0xfa9) [0x7f9e015b3a59]
[bt] (3) /home/levinth/mxnet/python/mxnet/../../lib/libmxnet.so(void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>, std::shared_ptrdmlc::ManualEvent const&)+0xeb) [0x7f9e015c9d0b]
[bt] (4) /home/levinth/mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler<void (std::shared_ptrdmlc::ManualEvent), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock, bool)::{lambda()#4}::operator()() const::{lambda(std::shared_ptrdmlc::ManualEvent)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptrdmlc::ManualEvent&&)+0x4e) [0x7f9e015c9f7e]
[bt] (5) /home/levinth/mxnet/python/mxnet/../../lib/libmxnet.so(std::thread::_Impl<std::_Bind_simple<std::function<void (std::shared_ptrdmlc::ManualEvent)> (std::shared_ptrdmlc::ManualEvent)> >::_M_run()+0x4a) [0x7f9e015b299a]
[bt] (6) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f9e4e10dc80]
[bt] (7) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba) [0x7f9e581196ba]
[bt] (8) /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f9e57e4f41d]
Minimum reproducible example
default deepspeech with LibriSpeech data set prepared per instructions
train.py edited to use
Steps to reproduce
see attachment for installation and invocation notes
mxnet_deepspeech_build_for_bug.txt
The text was updated successfully, but these errors were encountered: