Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Errors related to malloc and free #5728

Closed
GaiYu0 opened this issue Apr 7, 2017 · 29 comments
Closed

Errors related to malloc and free #5728

GaiYu0 opened this issue Apr 7, 2017 · 29 comments

Comments

@GaiYu0
Copy link

GaiYu0 commented Apr 7, 2017

Hi! I encounter these errors when training a network:

*** Error in `/usr/bin/python': malloc(): memory corruption (fast): 0x0000000001755880 ***

*** Error in `/usr/bin/python': free(): invalid pointer: 0x000000000171ec30 ***

I am using the latest version of mxnet from engine branch. Similar errors occur when I use mxnet from master branch.

Could anyone help? Thank you very much!

Unfortunately I cannot get Python stack trace. But C stack trace is available:

#0 0x00007ffff782dc37 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#1 0x00007ffff7831028 in __GI_abort () at abort.c:89
#2 0x00007ffff786a2a4 in __libc_message (do_abort=do_abort@entry=1, fmt=fmt@entry=0x7ffff79786b0 "*** Error in `%s': %s: 0x%s[32/1916]
at ../sysdeps/posix/libc_fatal.c:175
#3 0x00007ffff7874ff7 in malloc_printerr (action=, str=0x7ffff7978a50 "malloc(): memory corruption (fast)",
ptr=) at malloc.c:4996
#4 0x00007ffff7877cf4 in _int_malloc (av=0x7fff00000020, bytes=24) at malloc.c:3359
#5 0x00007ffff78796c0 in __GI___libc_malloc (bytes=24) at malloc.c:2891
#6 0x00007fffddad6dad in operator new(unsigned long) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#7 0x00007fffebcab89d in std::_Function_base::_Base_managermxnet::NDArray::Chunk::~Chunk()::{lambda(mxnet::RunContext)#2}::_M_manager(std::_Any_data&, std::_Function_base::_Base_managermxnet::NDArray::Chunk::~Chunk()::{lambda(mxnet::RunContext)#2} const&, std::_Manager_operation) () from /home/gaiyu/developping/mxnet_engine/python/mxnet/../../lib/libmxnet.so
#8 0x00007fffebcf715f in std::_Function_base::_Base_manager<mxnet::engine::ThreadedEngine::DeleteVariable(std::function<void (mxnet::RunContext)>, mxnet::Context, mxnet::engine::Var*)::{lambda(mxnet::RunContext)#1}>::_M_manager(std::_Any_data&, std::_Function_base::_Base_manager<mxnet::engine::ThreadedEngine::DeleteVariable(std::function<void (mxnet::RunContext)>, mxnet::Context, mxnet::engine::Var*)::{lambda(mxnet::RunContext)#1}> const&, std::_Manager_operation) ()
from /home/gaiyu/developping/mxnet_engine/python/mxnet/../../lib/libmxnet.so
#9 0x00007fffebcaca74 in std::function<void (mxnet::RunContext)>::function(std::function<void (mxnet::RunContext)> const&) ()[16/1916]
from /home/gaiyu/developping/mxnet_engine/python/mxnet/../../lib/libmxnet.so
#10 0x00007fffebcf73b1 in mxnet::engine::ThreadedEngine::DeleteVariable(std::function<void (mxnet::RunContext)>, mxnet::Context, mxnet::engine::Var*) () from /home/gaiyu/developping/mxnet_engine/python/mxnet/../../lib/libmxnet.so
#11 0x00007fffebcaba6d in std::_Sp_counted_ptr_inplace<mxnet::NDArray::Chunk, std::allocatormxnet::NDArray::Chunk, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /home/gaiyu/developping/mxnet_engine/python/mxnet/../../lib/libmxnet.so
#12 0x00007fffebcad78e in std::vector<mxnet::NDArray, std::allocatormxnet::NDArray >::~vector() ()
from /home/gaiyu/developping/mxnet_engine/python/mxnet/../../lib/libmxnet.so
#13 0x00007fffeb4c8eca in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release() ()
from /home/gaiyu/developping/mxnet_engine/python/mxnet/../../lib/libmxnet.so
#14 0x00007fffebd114b8 in std::_Function_base::_Base_manager<mxnet::exec::GraphExecutor::InitCachedOps()::{lambda(mxnet::RunContext, mxnet::engine::CallbackOnComplete)#3}>::_M_manager(std::_Any_data&, std::_Function_base::_Base_manager<mxnet::exec::GraphExecutor::InitCachedOps()::{lambda(mxnet::RunContext, mxnet::engine::CallbackOnComplete)#3}> const&, std::_Manager_operation) ()
from /home/gaiyu/developping/mxnet_engine/python/mxnet/../../lib/libmxnet.so
#15 0x00007fffebcf82d6 in std::_Function_handler<void (mxnet::RunContext), mxnet::engine::ThreadedEngine::DeleteOperator(mxnet::engine::Opr*)::{lambda(mxnet::RunContext)#1}>::_M_invoke(std::_Any_data const&, mxnet::RunContext) ()
from /home/gaiyu/developping/mxnet_engine/python/mxnet/../../lib/libmxnet.so
#16 0x00007fffebcab693 in operator() (__args#0=..., this=) at /usr/include/c++/4.8/functional:2471
#17 operator() (on_complete=..., ctx=..., __closure=) at include/mxnet/././engine.h:213
#18 std::_Function_handler<void (mxnet::RunContext, mxnet::engine::CallbackOnComplete), mxnet::Engine::PushSync(std::function<void (mxnet::RunContext)>, mxnet::Context, std::vector<mxnet::engine::Var*, std::allocatormxnet::engine::Var* > const&, std::vector<mxnet::engine::Var*, std::allocatormxnet::engine::Var* > const&, mxnet::FnProperty, int, char const*)::{lambda(mxnet::RunContext, mxnet::engine::CallbackOnComplete)#1}>::_M_invoke(std::_Any_data const&, mxnet::RunContext, mxnet::engine::CallbackOnComplete) (__functor=...,
__args#0=..., __args#1=...) at /usr/include/c++/4.8/functional:2071
#19 0x00007fffebcfe06c in mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*) ()
from /home/gaiyu/developping/mxnet_engine/python/mxnet/../../lib/libmxnet.so
#20 0x00007fffebd0097e in std::_Function_handler<void (), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#1}::operator()() const::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
from /home/gaiyu/developping/mxnet_engine/python/mxnet/../../lib/libmxnet.so
#21 0x00007fffddb29a60 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#22 0x00007ffff7bc4184 in start_thread (arg=0x7fff433fd700) at pthread_create.c:312
#23 0x00007ffff78f137d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

@piiswrong
Copy link
Contributor

please post a reproducible example using master branch

@sifmelcara
Copy link
Contributor

I can reproduce free() error similar to this issue by compile and run mxnet/cpp-package/example/lenet.cpp with cuda and cudnn support.

*** Error in `./a.out': free(): invalid pointer: 0x0000000002cc84e0 ***

This error will only occur when training with GPU. And it appears randomly, sometimes after 3 iteration, sometimes after 20 iteration.

@GaiYu0
Copy link
Author

GaiYu0 commented Apr 9, 2017

Similar to @sifmelcara, the error I encounter also appears randomly.

@sifmelcara
Copy link
Contributor

I can also reproduce this on mxnet 0.9.3a release (which is about two months old).
So I think this problem may relate to my cuda version or cudnn version or hardware...
For your information, I use
gcc 5.4.0
openblas 0.2.19
opencv 3.2.0
cudatoolkit 8.0.61
cuDNN v5.1 (Jan 20, 2017), for CUDA 8.0
Nvidia 378.13 driver (linux)
GTX 1080 Ti

@sifmelcara
Copy link
Contributor

To find out if this problem is related to my hardware, I grabbed my drive to my friends' computer which have a GTX 1070 and boot up the same OS then run the same binary (lenet), the program runs fine and did not crash.
So I would guess this issue is related to 1080 Ti, @GaiYu0 are you also using 1080 Ti?

PS. The strange thing is, run tensorflow with cuda and cudnn on my computer do not crash... so there might still be an issue with mxnet

@GaiYu0
Copy link
Author

GaiYu0 commented Apr 10, 2017

My program crashed on 2 machines. One uses GTX TITAN Black and the other uses GTX TITAN X.

@sifmelcara
Copy link
Contributor

I found this issue is related to int MXExecutorFree(ExecutorHandle handle). A temporary workaround for me is to delay every MXExecutorFree() call by 200ms.

I wonder if it is because in GraphExecutor::~GraphExecutor(), the code forgot to ensure the operator have finished computing before we calling Engine::Get()->DeleteOperator().

@eric-haibin-lin
Copy link
Member

@piiswrong we definitely need tests for cpp-package to run in CI

@GaiYu0
Copy link
Author

GaiYu0 commented Apr 13, 2017

The error I encounter now becomes:

image

image

I compiled MXNet with DEBUG flag, ran my program and obtained more comprehensible stack traces:

image

image

@eric-haibin-lin
Copy link
Member

Does setting env var

MXNET_EXEC_BULK_EXEC_INFERENCE=0
MXNET_EXEC_BULK_EXEC_TRAIN=0

make any difference in your case?

@GaiYu0
Copy link
Author

GaiYu0 commented Apr 13, 2017

@eric-haibin-lin These variables are exclusive and I should set them before running MXNet, right?

@eric-haibin-lin
Copy link
Member

it's read at runtime. Simply prepend them to the cmd you run
MXNET_EXEC_BULK_EXEC_INFERENCE=0 MXNET_EXEC_BULK_EXEC_TRAIN=0 path/to/executable

I'm just curious if this is caused by a recent change in executor (bulk execution).

@GaiYu0
Copy link
Author

GaiYu0 commented Apr 13, 2017

Another error resulted from training the same network:
image
Stack trace:
image

@sifmelcara
Copy link
Contributor

@eric-haibin-lin Looks like turning off bulk execution make no difference for me.

I would also like to provide my stack trace. (produced by cpp-package/examples/lenet.cpp)

(gdb) bt
#0  0x00007fffe5cea81b in malloc_consolidate () from /nix/store/izxnyg94352qxa4a4783dzgnpy5cwazj-glibc-2.25/lib/libc.so.6
#1  0x00007fffe5ceb400 in _int_free () from /nix/store/izxnyg94352qxa4a4783dzgnpy5cwazj-glibc-2.25/lib/libc.so.6
#2  0x00007fffe7bcee60 in __gnu_cxx::new_allocator<mxnet::TBlob>::deallocate (this=0x7ffed000dee8, __p=<optimized out>)
    at /nix/store/kwyac2lqc22xy6m2apprdw4lzpri08cj-gcc-5.4.0/include/c++/5.4.0/ext/new_allocator.h:110
#3  std::allocator_traits<std::allocator<mxnet::TBlob> >::deallocate (__a=..., __n=<optimized out>, __p=<optimized out>)
    at /nix/store/kwyac2lqc22xy6m2apprdw4lzpri08cj-gcc-5.4.0/include/c++/5.4.0/bits/alloc_traits.h:517
#4  std::_Vector_base<mxnet::TBlob, std::allocator<mxnet::TBlob> >::_M_deallocate (this=0x7ffed000dee8, __n=<optimized out>, __p=<optimized out>)
    at /nix/store/kwyac2lqc22xy6m2apprdw4lzpri08cj-gcc-5.4.0/include/c++/5.4.0/bits/stl_vector.h:178
#5  std::_Vector_base<mxnet::TBlob, std::allocator<mxnet::TBlob> >::~_Vector_base (this=0x7ffed000dee8, __in_chrg=<optimized out>)
    at /nix/store/kwyac2lqc22xy6m2apprdw4lzpri08cj-gcc-5.4.0/include/c++/5.4.0/bits/stl_vector.h:160
#6  std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> >::~vector (this=0x7ffed000dee8, __in_chrg=<optimized out>)
    at /nix/store/kwyac2lqc22xy6m2apprdw4lzpri08cj-gcc-5.4.0/include/c++/5.4.0/bits/stl_vector.h:425
#7  mxnet::exec::ForwardOpExecutor::~ForwardOpExecutor (this=0x7ffed000de30, __in_chrg=<optimized out>) at src/executor/attach_op_execs_pass.cc:25
#8  0x0000000000405f56 in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_release (this=0x7ffed000de20)
    at /nix/store/kwyac2lqc22xy6m2apprdw4lzpri08cj-gcc-5.4.0/include/c++/5.4.0/bits/shared_ptr_base.h:150
#9  0x00007fffe7baa588 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::~__shared_count (this=0x7ffed0014e58, __in_chrg=<optimized out>)
    at /nix/store/kwyac2lqc22xy6m2apprdw4lzpri08cj-gcc-5.4.0/include/c++/5.4.0/bits/shared_ptr_base.h:659
#10 std::__shared_ptr<mxnet::exec::OpExecutor, (__gnu_cxx::_Lock_policy)2>::~__shared_ptr (this=0x7ffed0014e50, __in_chrg=<optimized out>)
    at /nix/store/kwyac2lqc22xy6m2apprdw4lzpri08cj-gcc-5.4.0/include/c++/5.4.0/bits/shared_ptr_base.h:925
#11 std::shared_ptr<mxnet::exec::OpExecutor>::~shared_ptr (this=0x7ffed0014e50, __in_chrg=<optimized out>)
    at /nix/store/kwyac2lqc22xy6m2apprdw4lzpri08cj-gcc-5.4.0/include/c++/5.4.0/bits/shared_ptr.h:93
#12 mxnet::exec::GraphExecutor::<lambda(mxnet::RunContext, mxnet::Engine::CallbackOnComplete)>::~<lambda> (this=0x7ffed0014e50, __in_chrg=<optimized out>)
    at src/executor/graph_executor.cc:662
#13 std::_Function_base::_Base_manager<mxnet::exec::GraphExecutor::InitCachedOps()::<lambda(mxnet::RunContext, mxnet::Engine::CallbackOnComplete)> >::_M_destroy (__victim=...) at /nix/store/kwyac2lqc22xy6m2apprdw4lzpri08cj-gcc-5.4.0/include/c++/5.4.0/functional:1726
#14 std::_Function_base::_Base_manager<mxnet::exec::GraphExecutor::InitCachedOps()::<lambda(mxnet::RunContext, mxnet::Engine::CallbackOnComplete)> >::_M_manager(std::_Any_data &, const std::_Any_data &, std::_Manager_operation) (__dest=..., __source=..., __op=<optimized out>)
    at /nix/store/kwyac2lqc22xy6m2apprdw4lzpri08cj-gcc-5.4.0/include/c++/5.4.0/functional:1750
#15 0x00007fffe7b413ce in std::_Function_base::~_Function_base (this=0x1599dd0, __in_chrg=<optimized out>)
    at /nix/store/kwyac2lqc22xy6m2apprdw4lzpri08cj-gcc-5.4.0/include/c++/5.4.0/functional:1830
#16 std::function<void (mxnet::RunContext, mxnet::engine::CallbackOnComplete)>::~function() (this=0x1599dd0, __in_chrg=<optimized out>)
    at /nix/store/kwyac2lqc22xy6m2apprdw4lzpri08cj-gcc-5.4.0/include/c++/5.4.0/functional:1974
#17 mxnet::engine::ThreadedOpr::~ThreadedOpr (this=0x1599dd0, __in_chrg=<optimized out>) at src/engine/./threaded_engine.h:200
#18 mxnet::common::ObjectPool<mxnet::engine::ThreadedOpr>::Delete (this=0xa23b20, ptr=0x1599dd0) at src/engine/./../common/object_pool.h:139
#19 0x00007fffe77b425d in std::function<void (mxnet::RunContext)>::operator()(mxnet::RunContext) const (__args#0=..., this=<optimized out>)
    at /nix/store/kwyac2lqc22xy6m2apprdw4lzpri08cj-gcc-5.4.0/include/c++/5.4.0/functional:2267
#20 mxnet::Engine::PushSync(std::function<void (mxnet::RunContext)>, mxnet::Context, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, mxnet::FnProperty, int, char const*)::{lambda(mxnet::RunContext, mxnet::engine::CallbackOnComplete)#1}::operator()(mxnet::RunContext, mxnet::engine::CallbackOnComplete) const (on_complete=..., ctx=..., __closure=<optimized out>)
---Type <return> to continue, or q <return> to quit---
    at include/mxnet/././engine.h:211
#21 std::_Function_handler<void (mxnet::RunContext, mxnet::engine::CallbackOnComplete), mxnet::Engine::PushSync(std::function<void (mxnet::RunContext)>, mxnet::Context, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, mxnet::FnProperty, int, char const*)::{lambda(mxnet::RunContext, mxnet::engine::CallbackOnComplete)#1}>::_M_invoke(std::_Any_data const&, mxnet::RunContext&&, mxnet::engine::CallbackOnComplete&&) (__functor=..., __args#0=<optimized out>, __args#1=<optimized out>)
    at /nix/store/kwyac2lqc22xy6m2apprdw4lzpri08cj-gcc-5.4.0/include/c++/5.4.0/functional:1871
#22 0x00007fffe7b47e37 in std::function<void (mxnet::RunContext, mxnet::engine::CallbackOnComplete)>::operator()(mxnet::RunContext, mxnet::engine::CallbackOnComplete) const (__args#1=..., __args#0=..., this=0x15908f0) at /nix/store/kwyac2lqc22xy6m2apprdw4lzpri08cj-gcc-5.4.0/include/c++/5.4.0/functional:2267
#23 mxnet::engine::ThreadedEngine::ExecuteOprBlock (this=<optimized out>, run_ctx=..., opr_block=0xa17de8, this=<optimized out>)
    at src/engine/./threaded_engine.h:321
#24 0x00007fffe7b4e4f6 in mxnet::engine::ThreadedEnginePerDevice::CPUWorker<(dmlc::ConcurrentQueueType)0> (block=0xa14b70, this=0xa106f0)
    at src/engine/threaded_engine_perdevice.cc:180
#25 mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const
    (__closure=<optimized out>) at src/engine/threaded_engine_perdevice.cc:76
#26 std::_Function_handler<void (), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#1}::operator()() const::{lambda()#1}>::_M_invoke(std::_Any_data const&) (__functor=...) at /nix/store/kwyac2lqc22xy6m2apprdw4lzpri08cj-gcc-5.4.0/include/c++/5.4.0/functional:1871
#27 0x00007fffe65efd00 in ?? () from /nix/store/zag7bpja0fxm2r45x5xzdv8ff3rvj2nx-gcc-5.4.0-lib/lib/libstdc++.so.6
#28 0x00007fffcaf97234 in start_thread () from /nix/store/izxnyg94352qxa4a4783dzgnpy5cwazj-glibc-2.25/lib/libpthread.so.0
#29 0x00007fffe5d5b75f in clone () from /nix/store/izxnyg94352qxa4a4783dzgnpy5cwazj-glibc-2.25/lib/libc.so.6

@sifmelcara
Copy link
Contributor

I can confirm by setting MXNET_ENGINE_TYPE=NaiveEngine env var, the error, free(): invalid pointer, never occur again.

@eric-haibin-lin
Copy link
Member

I'll investigate if there's any issue for threaded engine in cpp-package over the weekend

@eric-haibin-lin
Copy link
Member

@GaiYu0 @sifmelcara Are the stack traces from the latest mxnet version? Did you try the latest version?

@GaiYu0
Copy link
Author

GaiYu0 commented Apr 14, 2017

I am using latest mxnet from engine branch.

@sifmelcara
Copy link
Contributor

sifmelcara commented Apr 15, 2017

The stack trace is from the master branch. However, I also tested v0.9.3 stable release and it have the same issue.

@eric-haibin-lin
Copy link
Member

@sifmelcara were you using mnist dataset as the input for the lenet example? The example code is expecting a 'train.csv' to read data from, what did you use as input?

@sifmelcara
Copy link
Contributor

I download training set from https://pjreddie.com/projects/mnist-in-csv/ and rename it to train.csv.

@eric-haibin-lin
Copy link
Member

@sifmelcara I ran lenet example for 5089 iters and could not reproduce this bug. I am running commit 96eb4f5 from this pr: #5844

@sifmelcara
Copy link
Contributor

sifmelcara commented Apr 16, 2017

I boot the exact same hard drive on two computer, one have 1080Ti, the other have 1070.
The bug only appears when running on 1080Ti. And since

  1. @GaiYu0 have TitanX, which also runs very fast.
  2. I can avoid this bug by defer the execution of destructor of Executor.
  3. I can avoid this bug by not using threaded engine.

I guess we need a fast GPU to reproduce this bug.

@eric-haibin-lin
Copy link
Member

eric-haibin-lin commented Apr 16, 2017

@sifmelcara well, I am able to reproduce @GaiYu0's issue with the same hardware on previous version of MXNet. But it seems to be fixed with 96eb4f5

@sifmelcara
Copy link
Contributor

I just tested it again, here is the steps I take to reproduce the bug.

  1. git clone https://github.com/eric-haibin-lin/mxnet.git --recursive
  2. cd mxnet && git checkout cpp && git submodule update --recursive
    (now at 2e66d77)
  3. cp make/config.mk .
  4. set
USE_CUDA = 1
USE_CUDNN = 1
USE_CPP_PACKAGE = 1
USE_BLAS=openblas
ADD_CFLAGS += -I.... (open blas include dir)
ADD_LDFLAGS += -lopencv_core -lopencv_imgproc -lopencv_imgcodecs
USE_CUDA_PATH=... (cuda tool kit 8.0.61 path)
  1. add -L/location/to/libcuda.so before -lcuda in Makefile
  2. make -j4
  3. cp cpp-package/example/lenet.cpp .
g++ --std=c++11 -Iinclude \
                -Innvm/include \
                -Idmlc-core/include \
                -Icpp-package/include \
    -msse3 -Llib -lmxnet lenet.cpp
  1. export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$(pwd)/lib
  2. get train.csv
  3. run ./a.out
  4. get Segmentation fault (Here is the output https://pastebin.com/2yWkEhGB )

@sifmelcara
Copy link
Contributor

@eric-haibin-lin I found a way to consistently reproduce the issue on my machine.

  1. Run ./a.out (lenet example)
  2. Wait until training begins.
  3. Run stress -i 1000 (this stress tool is available in most distributions)
  4. lenet example crashs and print free(): invalid pointer. Sometimes it do not immediately crash, just wait a few seconds then kill stress, lenet will crash after stress being killed.

I really appreciate your help in this issue. Thank you.

@sifmelcara
Copy link
Contributor

Since @eric-haibin-lin reported that @GaiYu0's issue probably been fixed by 96eb4f5 , I guess my issue is somewhat different from this issue.
After several days of code tracing, I still cannot solve my issue so I opened another issue #6039 .

@eric-haibin-lin
Copy link
Member

Looks like both issues are fixed. Closing it for now.

@maylad31
Copy link

maylad31 commented Jan 3, 2019

hi can anyone please tell me how to delay MXExecutorFree() call

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants