-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory issue on Windows #1474
Comments
There is a similar discussion here
#1289 (comment)
(but it was another turkish guy, maybe the same team?)
My take-away at the time was that it seemed to be either an issue of the
BLAS library, or some problem deep inside Windows or its debugging tools,
because the code that was involved was not doing anything that could
possibly cause a memory leak.
You say that there is an error on Ubuntu 16.04. I'd be interested in
investigating that more, as that's more likely to be a problem in Kaldi
itself, and at the very least will be easier to debug. Can you run that
test setup in valgrind?
Dan
…On Mon, Mar 6, 2017 at 1:48 AM, yusuf-gunaydin ***@***.***> wrote:
We have been using Nnet2 models for a time and had no issues. Recently we
had updated our systems to use Nnet3 models and after the update our
servers started to show memory leaks.
The problem is not an actual memory leak. I have checked the code with
Valgrind on Linux and DrMemory & VisualLeakDetector on Windows. None of
them reports any leak. The problem reveals itself only if there are
multiple recognition threads. (I have tried on a 8 core machine with 8
threads and on a 32 core machine with 24 threads.)
Things I have tried to compare multiple environments to identify the issue:
-
Platforms:
- CentOS 6.6 - No leak.
- Windows Server 2012 - Leak.
- Windows 10 - Leak.
- MinGW on Windows 10 - Leak.
- Cygwin on Windows 10 - A strange divide by zero error.
- Ubuntu 16.04 - A strange divide by zero error.
-
Architecture:
- x64 - Leak.
- x86 - Leak.
-
Compilers:
- Visual Studio 2012 - Leak.
- Visual Studio 2015 - Leak.
- Intel Compiler included in Parallel Studio XE 2015 - Leak.
-
CBLAS Libraries:
- Intel MKL 2015 - Leak.
- Intel MKL 2017 - Leak.
- CLAPACK - Leak.
After a month of struggling with this problem I am starting to think this
is a problem with Windows's memory management. I have spent too much time
on this issue and I wanted to ask if the problem has occured to anybody
else.
I have minimized the problem into the following code and the nnet model I
have been using is here
<https://github.com/kaldi-asr/kaldi/files/820326/nnet.zip>:
#include <iostream>
#include <thread>
#include <string>
#include "util/kaldi-io.h"
#include "nnet3/nnet-nnet.h"
#include "nnet3/nnet-computation.h"
#include "nnet3/nnet-computation-graph.h"
int threadCount = 8;
int main(int argc, char * argv[])
{
std::cout << "Press enter to start." << std::endl;
std::cin.get();
std::vector<std::thread> threads;
for (int i = 0; i < threadCount; i++)
{
threads.emplace_back([]()
{
bool isBinary;
kaldi::Input kaldiInput("nnet.bin", &isBinary);
kaldi::nnet3::Nnet nnet;
nnet.Read(kaldiInput.Stream(), isBinary);
for (int j = 0; j < 10000; j++)
{
for (int k = 0; k < 10000; k++)
{
int input_end = 0 + 100;
kaldi::nnet3::IoSpecification input;
input.name = "input";
kaldi::nnet3::IoSpecification output;
output.name = "output";
int n = rand() % 10;
// in the IoSpecification for now we we will request all the same indexes at
// output that we requested at input.
for (int t = 0; t < input_end; t++) {
input.indexes.push_back(kaldi::nnet3::Index(n, t));
output.indexes.push_back(kaldi::nnet3::Index(n, t));
}
kaldi::nnet3::ComputationRequest request;
request.inputs.push_back(input);
request.outputs.push_back(output);
kaldi::nnet3::ComputationGraph graph;
kaldi::nnet3::ComputationGraphBuilder builder(nnet, request, &graph);
builder.Compute();
}
}
});
}
for (auto & th : threads)
{
th.join();
}
std::cout << "Done." << std::endl;
std::cin.get();
return 0;
}
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#1474>, or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADJVu_CsTky9R-ZRS4vq7xSzo-7qPuDyks5ri6xKgaJpZM4MTyFV>
.
|
Yes the same team but the problem is different. We have solved that issue by disabling multithreading on ivector extraction. You can see the divide by zero issue by compiling the code I sent in the first post on Ubuntu 16.04 with gcc 5.4.0 (4.8 and 4.9 also give me the same issue on Ubuntu 14.04.) I compiled the code with this command: I am copying the gdb output on my machine:
With backtrace:
|
try running it in valgrind-- it could be there was an earlier error, like
memory overwriting, that caused the actual error. (valgrind --args
[program] [args])
…On Mon, Mar 6, 2017 at 11:56 PM, yusuf-gunaydin ***@***.***> wrote:
Yes the same team but the problem is different. We have solved that issue
by disabling multithreading on ivector extraction.
You can see the divide by zero issue by compiling the code I sent in the
first post on Ubuntu 16.04 with gcc 5.4.0 (4.8 and 4.9 also give me the
same issue on Ubuntu 14.04.)
I compiled the code with this command: g++ -std=c++11 test.cpp
-Ikaldi/src/ -Ikaldi/tools/openfst/include -DHAVE_OPENBLAS
-Ikaldi/tools/OpenBLAS/install/include -Lkaldi/tools/OpenBLAS/install/lib
-Lkaldi/tools/openfst/lib -Lkaldi/src/util/ -Lkaldi/src/nnet3/ -lopenblas
-lfst -lpthread -lkaldi-util -lkaldi-nnet3
I am copying the gdb output on my machine:
Thread 3 "a.out" received signal SIGFPE, Arithmetic exception.
[Switching to Thread 0x7fffeaeff700 (LWP 30603)]
0x00007ffff74a1d03 in std::tr1::__detail::_Mod_range_hashing::operator() (
this=0x7fffeaefec8b, __num=9714, __den=0)
at /usr/include/c++/5/tr1/hashtable_policy.h:369
369 { return __num % __den; }
With backtrace:
#0 0x00007ffff74a1d03 in std::tr1::__detail::_Mod_range_hashing::operator() (
this=0x7fffeaefec8b, __num=9714, __den=0)
at /usr/include/c++/5/tr1/hashtable_policy.h:369
#1 0x00007ffff75480b1 in std::tr1::__detail::_Hash_code_base<std::pair<int, kaldi::nnet3::Index>, std::pair<std::pair<int, kaldi::nnet3::Index> const, int>, std::_Select1st<std::pair<std::pair<int, kaldi::nnet3::Index> const, int> >, std::equal_to<std::pair<int, kaldi::nnet3::Index> >, kaldi::nnet3::CindexHasher, std::tr1::__detail::_Mod_range_hashing, std::tr1::__detail::_Default_ranged_hash, false>::_M_bucket_index (this=0x7fffeaefec88, __c=9714, __n=0)
at /usr/include/c++/5/tr1/hashtable_policy.h:677
#2 0x00007ffff7547c78 in std::tr1::_Hashtable<std::pair<int, kaldi::nnet3::Index>, std::pair<std::pair<int, kaldi::nnet3::Index> const, int>, std::allocator<std::pair<std::pair<int, kaldi::nnet3::Index> const, int> >, std::_Select1st<std::pair<std::pair<int, kaldi::nnet3::Index> const, int> >, std::equal_to<std::pair<int, kaldi::nnet3::Index> >, kaldi::nnet3::CindexHasher, std::tr1::__detail::_Mod_range_hashing, std::tr1::__detail::_Default_ranged_hash, std::tr1::__detail::_Prime_rehash_policy, false, false, true>::_M_insert (this=0x7fffeaefec88,
__v=...) at /usr/include/c++/5/tr1/hashtable.h:893
#3 0x00007ffff7545f4a in std::tr1::_Hashtable<std::pair<int, kaldi::nnet3::Index>, std::pair<std::pair<int, kaldi::nnet3::Index> const, int>, std::allocator<std::pair<std::pair<int, kaldi::nnet3::Index> const, int> >, std::_Select1st<std::pair<std::pair<int, kaldi::nnet3::Index> const, int> >, std::equal_to<std::pair<int, kaldi::nnet3::Index> >, kaldi::nnet3::CindexHasher, std::tr1::__detail::_Mod_range_hashing, std::tr1::__detail::_Default_ranged_hash, std::tr1::__detail::_Prime_rehash_policy, false, false, true>::insert (this=0x7fffeaefec88, __v=...)
at /usr/include/c++/5/tr1/hashtable.h:376
#4 0x00007ffff753bea1 in kaldi::nnet3::ComputationGraph::GetCindexId (
this=0x7fffeaefec30, cindex=..., input=true, is_new=0x7fffeaefe766)
at nnet-computation-graph.cc:33
#5 0x00007ffff753d5f7 in kaldi::nnet3::ComputationGraphBuilder::AddInputs (
this=0x7fffeaefecc0) at nnet-computation-graph.cc:244
#6 0x00007ffff753e902 in kaldi::nnet3::ComputationGraphBuilder::Compute (
this=0x7fffeaefecc0) at nnet-computation-graph.cc:434
#7 0x000000000040b003 in main::{lambda()#1}::operator()() const ()
#8 0x000000000040c650 in void std::_Bind_simple<main::{lambda()#1} ()>::_M_invoke<>(std::_Index_tuple<>) ()
#9 0x000000000040c5a6 in std::_Bind_simple<main::{lambda()#1} ()>::operator()() ()
#10 0x000000000040c536 in std::thread::_Impl<std::_Bind_simple<main::{lambda()#1} ()> >::_M_run() ()
#11 0x00007ffff6f37c80 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#12 0x00007ffff7bc16ba in start_thread (arg=0x7fffeaeff700)
at pthread_create.c:333
#13 0x00007ffff669d82d in clone ()
at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1474 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADJVuwqaK97o5KWsrKZ2w0vt48fPWJu4ks5rjPGYgaJpZM4MTyFV>
.
|
I have checked with valgrind. The output is below. Seems there is no new information.
|
it was killed due to OOM, before it could reach any errors. Rerun with
fewer threads.
…On Tue, Mar 7, 2017 at 12:36 AM, yusuf-gunaydin ***@***.***> wrote:
I have checked with valgrind. The output is below. Seems there is no new
information.
==31982== Memcheck, a memory error detector
==31982== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==31982== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==31982== Command: ./a.out
==31982==
Press enter to start.
==31982==
==31982== Process terminating with default action of signal 8 (SIGFPE)
==31982== Integer divide by zero at address 0x8034464EF
==31982== at 0x557BD03: std::tr1::__detail::_Mod_range_hashing::operator()(unsigned long, unsigned long) const (hashtable_policy.h:369)
==31982== by 0x56220B0: std::tr1::__detail::_Hash_code_base<std::pair<int, kaldi::nnet3::Index>, std::pair<std::pair<int, kaldi::nnet3::Index> const, int>, std::_Select1st<std::pair<std::pair<int, kaldi::nnet3::Index> const, int> >, std::equal_to<std::pair<int, kaldi::nnet3::Index> >, kaldi::nnet3::CindexHasher, std::tr1::__detail::_Mod_range_hashing, std::tr1::__detail::_Default_ranged_hash, false>::_M_bucket_index(std::pair<int, kaldi::nnet3::Index> const&, unsigned long, unsigned long) const (hashtable_policy.h:677)
==31982== by 0x5621C77: std::tr1::_Hashtable<std::pair<int, kaldi::nnet3::Index>, std::pair<std::pair<int, kaldi::nnet3::Index> const, int>, std::allocator<std::pair<std::pair<int, kaldi::nnet3::Index> const, int> >, std::_Select1st<std::pair<std::pair<int, kaldi::nnet3::Index> const, int> >, std::equal_to<std::pair<int, kaldi::nnet3::Index> >, kaldi::nnet3::CindexHasher, std::tr1::__detail::_Mod_range_hashing, std::tr1::__detail::_Default_ranged_hash, std::tr1::__detail::_Prime_rehash_policy, false, false, true>::_M_insert(std::pair<std::pair<int, kaldi::nnet3::Index> const, int> const&, std::tr1::integral_constant<bool, true>) (hashtable.h:893)
==31982== by 0x561FF49: std::tr1::_Hashtable<std::pair<int, kaldi::nnet3::Index>, std::pair<std::pair<int, kaldi::nnet3::Index> const, int>, std::allocator<std::pair<std::pair<int, kaldi::nnet3::Index> const, int> >, std::_Select1st<std::pair<std::pair<int, kaldi::nnet3::Index> const, int> >, std::equal_to<std::pair<int, kaldi::nnet3::Index> >, kaldi::nnet3::CindexHasher, std::tr1::__detail::_Mod_range_hashing, std::tr1::__detail::_Default_ranged_hash, std::tr1::__detail::_Prime_rehash_policy, false, false, true>::insert(std::pair<std::pair<int, kaldi::nnet3::Index> const, int> const&) (hashtable.h:376)
==31982== by 0x5615EA0: kaldi::nnet3::ComputationGraph::GetCindexId(std::pair<int, kaldi::nnet3::Index> const&, bool, bool*) (nnet-computation-graph.cc:33)
==31982== by 0x56175F6: kaldi::nnet3::ComputationGraphBuilder::AddInputs() (nnet-computation-graph.cc:244)
==31982== by 0x5618901: kaldi::nnet3::ComputationGraphBuilder::Compute() (nnet-computation-graph.cc:434)
==31982== by 0x40B002: main::{lambda()#1}::operator()() const (in Desktop/a.out)
==31982== by 0x40C64F: void std::_Bind_simple<main::{lambda()#1} ()>::_M_invoke<>(std::_Index_tuple<>) (in Desktop/a.out)
==31982== by 0x40C5A5: std::_Bind_simple<main::{lambda()#1} ()>::operator()() (in Desktop/a.out)
==31982== by 0x40C535: std::thread::_Impl<std::_Bind_simple<main::{lambda()#1} ()> >::_M_run() (in Desktop/a.out)
==31982== by 0x5AC8C7F: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
==31982==
==31982== HEAP SUMMARY:
==31982== in use at exit: 13,218,072 bytes in 6,044 blocks
==31982== total heap usage: 8,476 allocs, 2,432 frees, 13,457,566 bytes allocated
==31982==
==31982== LEAK SUMMARY:
==31982== definitely lost: 0 bytes in 0 blocks
==31982== indirectly lost: 0 bytes in 0 blocks
==31982== possibly lost: 6,912 bytes in 24 blocks
==31982== still reachable: 13,211,160 bytes in 6,020 blocks
==31982== suppressed: 0 bytes in 0 blocks
==31982== Rerun with --leak-check=full to see details of leaked memory
==31982==
==31982== For counts of detected and suppressed errors, rerun with: -v
==31982== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Killed
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1474 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADJVuyWDZnRp0NuCIpn0tFFggfrwsGmXks5rjPrmgaJpZM4MTyFV>
.
|
Sorry, I see that it did reach the error.
It's very strange that valgrind did not report any errors there and it's
still crashing. After how many iterations does it fail, typically? Also
see whether the memory locations printed are consistent (unlikely), or
change each time.
…On Tue, Mar 7, 2017 at 9:18 AM, Daniel Povey ***@***.***> wrote:
it was killed due to OOM, before it could reach any errors. Rerun with
fewer threads.
On Tue, Mar 7, 2017 at 12:36 AM, yusuf-gunaydin ***@***.***>
wrote:
> I have checked with valgrind. The output is below. Seems there is no new
> information.
>
> ==31982== Memcheck, a memory error detector
> ==31982== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
> ==31982== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
> ==31982== Command: ./a.out
> ==31982==
> Press enter to start.
>
> ==31982==
> ==31982== Process terminating with default action of signal 8 (SIGFPE)
> ==31982== Integer divide by zero at address 0x8034464EF
> ==31982== at 0x557BD03: std::tr1::__detail::_Mod_range_hashing::operator()(unsigned long, unsigned long) const (hashtable_policy.h:369)
> ==31982== by 0x56220B0: std::tr1::__detail::_Hash_code_base<std::pair<int, kaldi::nnet3::Index>, std::pair<std::pair<int, kaldi::nnet3::Index> const, int>, std::_Select1st<std::pair<std::pair<int, kaldi::nnet3::Index> const, int> >, std::equal_to<std::pair<int, kaldi::nnet3::Index> >, kaldi::nnet3::CindexHasher, std::tr1::__detail::_Mod_range_hashing, std::tr1::__detail::_Default_ranged_hash, false>::_M_bucket_index(std::pair<int, kaldi::nnet3::Index> const&, unsigned long, unsigned long) const (hashtable_policy.h:677)
> ==31982== by 0x5621C77: std::tr1::_Hashtable<std::pair<int, kaldi::nnet3::Index>, std::pair<std::pair<int, kaldi::nnet3::Index> const, int>, std::allocator<std::pair<std::pair<int, kaldi::nnet3::Index> const, int> >, std::_Select1st<std::pair<std::pair<int, kaldi::nnet3::Index> const, int> >, std::equal_to<std::pair<int, kaldi::nnet3::Index> >, kaldi::nnet3::CindexHasher, std::tr1::__detail::_Mod_range_hashing, std::tr1::__detail::_Default_ranged_hash, std::tr1::__detail::_Prime_rehash_policy, false, false, true>::_M_insert(std::pair<std::pair<int, kaldi::nnet3::Index> const, int> const&, std::tr1::integral_constant<bool, true>) (hashtable.h:893)
> ==31982== by 0x561FF49: std::tr1::_Hashtable<std::pair<int, kaldi::nnet3::Index>, std::pair<std::pair<int, kaldi::nnet3::Index> const, int>, std::allocator<std::pair<std::pair<int, kaldi::nnet3::Index> const, int> >, std::_Select1st<std::pair<std::pair<int, kaldi::nnet3::Index> const, int> >, std::equal_to<std::pair<int, kaldi::nnet3::Index> >, kaldi::nnet3::CindexHasher, std::tr1::__detail::_Mod_range_hashing, std::tr1::__detail::_Default_ranged_hash, std::tr1::__detail::_Prime_rehash_policy, false, false, true>::insert(std::pair<std::pair<int, kaldi::nnet3::Index> const, int> const&) (hashtable.h:376)
> ==31982== by 0x5615EA0: kaldi::nnet3::ComputationGraph::GetCindexId(std::pair<int, kaldi::nnet3::Index> const&, bool, bool*) (nnet-computation-graph.cc:33)
> ==31982== by 0x56175F6: kaldi::nnet3::ComputationGraphBuilder::AddInputs() (nnet-computation-graph.cc:244)
> ==31982== by 0x5618901: kaldi::nnet3::ComputationGraphBuilder::Compute() (nnet-computation-graph.cc:434)
> ==31982== by 0x40B002: main::{lambda()#1}::operator()() const (in Desktop/a.out)
> ==31982== by 0x40C64F: void std::_Bind_simple<main::{lambda()#1} ()>::_M_invoke<>(std::_Index_tuple<>) (in Desktop/a.out)
> ==31982== by 0x40C5A5: std::_Bind_simple<main::{lambda()#1} ()>::operator()() (in Desktop/a.out)
> ==31982== by 0x40C535: std::thread::_Impl<std::_Bind_simple<main::{lambda()#1} ()> >::_M_run() (in Desktop/a.out)
> ==31982== by 0x5AC8C7F: ??? (in /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.21)
> ==31982==
> ==31982== HEAP SUMMARY:
> ==31982== in use at exit: 13,218,072 bytes in 6,044 blocks
> ==31982== total heap usage: 8,476 allocs, 2,432 frees, 13,457,566 bytes allocated
> ==31982==
> ==31982== LEAK SUMMARY:
> ==31982== definitely lost: 0 bytes in 0 blocks
> ==31982== indirectly lost: 0 bytes in 0 blocks
> ==31982== possibly lost: 6,912 bytes in 24 blocks
> ==31982== still reachable: 13,211,160 bytes in 6,020 blocks
> ==31982== suppressed: 0 bytes in 0 blocks
> ==31982== Rerun with --leak-check=full to see details of leaked memory
> ==31982==
> ==31982== For counts of detected and suppressed errors, rerun with: -v
> ==31982== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
> Killed
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#1474 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/ADJVuyWDZnRp0NuCIpn0tFFggfrwsGmXks5rjPrmgaJpZM4MTyFV>
> .
>
|
The crash occurs in single thread, too. Actually the exception is thrown when the program first accesses the The strange thing is if I run the code below on a separate program, it does not throw:
|
Hm. What I suspect is that the cindex_to_cindex_id_ object has not been
properly initialized, either because the constructor was not called or
because the stack or heap variable was overwritten by a bad memory access.
[But if it were a bad memory access, normally valgrind would show it up...
if you initialize as many variables as possible via 'new', on the heap, it
might be helpful, because it could be that valgrind has a hard time
detecting invalid accesses on the stack].
It could possibly be a compiler error-- perhaps the support for lambdas is
buggy. Maybe try declaring a function in the normal way, without lambdas,
and use that to initialize the thread.
I would recommend to update to the latest Kaldi code and see if the error
persists, because it's possible it was some bug that was fixed some time
ago. I did notice your line numbers are not the same as in the current code.
Dan
…On Wed, Mar 8, 2017 at 12:02 AM, yusuf-gunaydin ***@***.***> wrote:
The crash occurs in single thread, too. Actually the exception is thrown
when the program first accesses the int32 ComputationGraph::GetCindexId(const
Cindex &cindex, bool input, bool *is_new) function. The throwing values
are:
j: 0
k: 0
new_index: 0
cindex: (0 (3 0 0))
The strange thing is if I run the code below on a separate program, it
does not throw:
typedef unordered_map<kaldi::nnet3::Cindex, int32, kaldi::nnet3::CindexHasher> map_type;
map_type cindex_to_cindex_id_;
int new_index = cindex_to_cindex_id_.size();
kaldi::nnet3::Cindex cindex(0, kaldi::nnet3::Index(3, 0, 0));
std::pair<map_type::iterator, bool> p = cindex_to_cindex_id_.insert(std::pair<kaldi::nnet3::Cindex, int32>(cindex, new_index));
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1474 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADJVu7vCJ0zIwIGvaQMq2VQAA-gz-zwjks5rjkSEgaJpZM4MTyFV>
.
|
For the Ubuntu problem, I noticed that you are spelling out the command you
compiled with...
If you didn't compile using the standard Makefile, there is a possibility
that you didn't supply exactly the same flags, and this could change the
size of objects. I notice you're not specifying the value of
KALDI_DOUBLEPRECISION. I just had a look at how the header interrogates
this value, and it does:
#if (KALDI_DOUBLEPRECISION != 0)
typedef double BaseFloat;
#else
typedef float BaseFloat;
#endif
so if you don't define it, it's possible that it would use double for
BaseFloat, changing the sizes of objects.
This may be the reason for the problem.
It's unrelated to the Windows issue. I suspect the Windows issue is caused
either by problems in BLAS libraries, or by errors in the instrumentation.
…On Wed, Mar 8, 2017 at 12:08 AM, Daniel Povey ***@***.***> wrote:
Hm. What I suspect is that the cindex_to_cindex_id_ object has not been
properly initialized, either because the constructor was not called or
because the stack or heap variable was overwritten by a bad memory access.
[But if it were a bad memory access, normally valgrind would show it up...
if you initialize as many variables as possible via 'new', on the heap, it
might be helpful, because it could be that valgrind has a hard time
detecting invalid accesses on the stack].
It could possibly be a compiler error-- perhaps the support for lambdas is
buggy. Maybe try declaring a function in the normal way, without lambdas,
and use that to initialize the thread.
I would recommend to update to the latest Kaldi code and see if the error
persists, because it's possible it was some bug that was fixed some time
ago. I did notice your line numbers are not the same as in the current code.
Dan
On Wed, Mar 8, 2017 at 12:02 AM, yusuf-gunaydin ***@***.***>
wrote:
> The crash occurs in single thread, too. Actually the exception is thrown
> when the program first accesses the int32 ComputationGraph::GetCindexId(const
> Cindex &cindex, bool input, bool *is_new) function. The throwing values
> are:
> j: 0
> k: 0
> new_index: 0
> cindex: (0 (3 0 0))
>
> The strange thing is if I run the code below on a separate program, it
> does not throw:
>
> typedef unordered_map<kaldi::nnet3::Cindex, int32, kaldi::nnet3::CindexHasher> map_type;
> map_type cindex_to_cindex_id_;
> int new_index = cindex_to_cindex_id_.size();
> kaldi::nnet3::Cindex cindex(0, kaldi::nnet3::Index(3, 0, 0));
> std::pair<map_type::iterator, bool> p = cindex_to_cindex_id_.insert(std::pair<kaldi::nnet3::Cindex, int32>(cindex, new_index));
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#1474 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/ADJVu7vCJ0zIwIGvaQMq2VQAA-gz-zwjks5rjkSEgaJpZM4MTyFV>
> .
>
|
hm, turns out that if undefined it would default to float, so unless you
had configured for double precision this shouldn't be the problem. but it
could *possibly* be some other flag.
…On Wed, Mar 8, 2017 at 12:26 AM, Daniel Povey ***@***.***> wrote:
For the Ubuntu problem, I noticed that you are spelling out the command
you compiled with...
If you didn't compile using the standard Makefile, there is a possibility
that you didn't supply exactly the same flags, and this could change the
size of objects. I notice you're not specifying the value of
KALDI_DOUBLEPRECISION. I just had a look at how the header interrogates
this value, and it does:
#if (KALDI_DOUBLEPRECISION != 0)
typedef double BaseFloat;
#else
typedef float BaseFloat;
#endif
so if you don't define it, it's possible that it would use double for
BaseFloat, changing the sizes of objects.
This may be the reason for the problem.
It's unrelated to the Windows issue. I suspect the Windows issue is
caused either by problems in BLAS libraries, or by errors in the
instrumentation.
On Wed, Mar 8, 2017 at 12:08 AM, Daniel Povey ***@***.***> wrote:
> Hm. What I suspect is that the cindex_to_cindex_id_ object has not been
> properly initialized, either because the constructor was not called or
> because the stack or heap variable was overwritten by a bad memory access.
> [But if it were a bad memory access, normally valgrind would show it up...
> if you initialize as many variables as possible via 'new', on the heap, it
> might be helpful, because it could be that valgrind has a hard time
> detecting invalid accesses on the stack].
> It could possibly be a compiler error-- perhaps the support for lambdas
> is buggy. Maybe try declaring a function in the normal way, without
> lambdas, and use that to initialize the thread.
> I would recommend to update to the latest Kaldi code and see if the error
> persists, because it's possible it was some bug that was fixed some time
> ago. I did notice your line numbers are not the same as in the current code.
>
> Dan
>
>
> On Wed, Mar 8, 2017 at 12:02 AM, yusuf-gunaydin ***@***.***
> > wrote:
>
>> The crash occurs in single thread, too. Actually the exception is thrown
>> when the program first accesses the int32 ComputationGraph::GetCindexId(const
>> Cindex &cindex, bool input, bool *is_new) function. The throwing values
>> are:
>> j: 0
>> k: 0
>> new_index: 0
>> cindex: (0 (3 0 0))
>>
>> The strange thing is if I run the code below on a separate program, it
>> does not throw:
>>
>> typedef unordered_map<kaldi::nnet3::Cindex, int32, kaldi::nnet3::CindexHasher> map_type;
>> map_type cindex_to_cindex_id_;
>> int new_index = cindex_to_cindex_id_.size();
>> kaldi::nnet3::Cindex cindex(0, kaldi::nnet3::Index(3, 0, 0));
>> std::pair<map_type::iterator, bool> p = cindex_to_cindex_id_.insert(std::pair<kaldi::nnet3::Cindex, int32>(cindex, new_index));
>>
>> —
>> You are receiving this because you commented.
>> Reply to this email directly, view it on GitHub
>> <#1474 (comment)>,
>> or mute the thread
>> <https://github.com/notifications/unsubscribe-auth/ADJVu7vCJ0zIwIGvaQMq2VQAA-gz-zwjks5rjkSEgaJpZM4MTyFV>
>> .
>>
>
>
|
I don't think it is a flag issue. Kaldi is compiled with default makefile and the same compilation process on CentOS 6.6 does not give the error. It occurs on Ubuntu and Cygwin. Do all the Linux distributions use the same standard library? It might be that Ubuntu's |
I tried the code without lambda and newed all the objects, but the valgrind output is still the same. |
Hm. Try with the latest Kaldi code; and try without using a separate
thread, that might narrow it down.
…On Wed, Mar 8, 2017 at 1:34 AM, yusuf-gunaydin ***@***.***> wrote:
I tried the code without lambda and newed all the objects but, the
valgrind output is still the same.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1474 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADJVu7LecezGiGJlKdHre8JSYjVm7UImks5rjlnogaJpZM4MTyFV>
.
|
BTW, the fact that your program is called a.out implies that your *test
code* was not compiled with the standard makefile. Your test code should
use exactly the same flags and gcc command line as Kaldi (e.g. add it to
BINFILES in a binary directory).
…On Wed, Mar 8, 2017 at 10:31 AM, Daniel Povey ***@***.***> wrote:
Hm. Try with the latest Kaldi code; and try without using a separate
thread, that might narrow it down.
On Wed, Mar 8, 2017 at 1:34 AM, yusuf-gunaydin ***@***.***>
wrote:
> I tried the code without lambda and newed all the objects but, the
> valgrind output is still the same.
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#1474 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/ADJVu7LecezGiGJlKdHre8JSYjVm7UImks5rjlnogaJpZM4MTyFV>
> .
>
|
Thanks for the tip. Adding the code to the makefile solves the issue. I will try to compile with Cygwin and the error should be gone there, too. This way I can check if the leak is caused by Windows OS or the Microsoft's STL. |
We have been using Nnet2 models for a time and had no issues. Recently we had updated our systems to use Nnet3 models and after the update our servers started to show memory leaks.
The problem is not an actual memory leak. I have checked the code with Valgrind on Linux and DrMemory & VisualLeakDetector on Windows. None of them reports any leak. The problem reveals itself only if there are multiple recognition threads. (I have tried on a 8 core machine with 8 threads and on a 32 core machine with 24 threads.)
Things I have tried to compare multiple environments to identify the issue:
Platforms:
Architecture:
Compilers:
CBLAS Libraries:
After a month of struggling with this problem I am starting to think this is a problem with Windows's memory management. I have spent too much time on this issue and I wanted to ask if the problem has occured to anybody else.
I have minimized the problem into the following code and the nnet model I have been using is here:
The text was updated successfully, but these errors were encountered: