Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client may crash during shutdown when ShutdownTask is being run #430

Merged
merged 3 commits into from
Jun 26, 2018

Conversation

ihsandemir
Copy link
Collaborator

@ihsandemir ihsandemir commented Jun 18, 2018

Fix for issues #428 and #429. Do not access any object after runnable is finished. An e.g. scenario: ShutdownTask is started, client is shutdown and client destructor is being executed which causes the ClientConnectionManagerImpl to be destructed and the shared_ptr in connectionManager.shutdownThreads is destructed and since there is no other reference to the thread, the thread object is also destructed while the thread did not finish the runnableThread method.

Do not access any client object after ShutdownTask runnable is finished. Obtain the client name at the constructor of ShutdownTask to eliminate the illegal memory access when the client is shutdown e.g. during heartbeat failure and the client object is destructed already (the destructor waits on a latch which is released once the ShutdownTask::run is finished. Hence the ShutdownTask thread should not access the possibly destructed client on thread destructor where the name of the thread is written.

fixes #428
fixes #429

…ect after runnable is finished. An e.g. scenario: ShutdownTask is started, client is shutdown and client destructor is being executed which causes the ClientConnectionManagerImpl to be destructed and the shared_ptr in connectionManager.shutdownThreads is destructed and since there is no other reference to the thread, the thread object is also destructed while the thread did not finish the runnableThread method. This fix simply do not delete the heap allocated shutdown thread. We assume that there will not be too many of such threads, and hence the leak can be ignored.
@devOpsHazelcast
Copy link
Contributor

Windows test PASSed.

@devOpsHazelcast
Copy link
Contributor

Linux test PASSed.

@ihsandemir ihsandemir modified the milestones: 3.10, 3.10.1 Jun 18, 2018
ihsandemir and others added 2 commits June 25, 2018 23:56
…e the illegal memory access when the client is shutdown e.g. during heartbeat failure and the client object is destructed already (the destructor waits on a latch which is released once the ShutdownTask::run is finished. Hence the ShutdownTask thread should not access the possibly destructed client on thread destructor where the name of the thread is written.

fixes hazelcast#428
fixes hazelcast#429

Cleaned up unused header file and corrected the exception print at the ClientPartitionServiceImpl.
@devOpsHazelcast
Copy link
Contributor

Windows test PASSed.

@devOpsHazelcast
Copy link
Contributor

Linux test PASSed.

1 similar comment
@devOpsHazelcast
Copy link
Contributor

Linux test PASSed.

@devOpsHazelcast
Copy link
Contributor

Windows test PASSed.

@ihsandemir ihsandemir merged commit b67f2fb into hazelcast:master Jun 26, 2018
@ihsandemir ihsandemir deleted the shutdownCrashFix branch June 26, 2018 12:28
@siwhiting
Copy link

Have built against latest hazelcast::master as of 2nd July, and re-run test for #428 & #429.

Now seeing SEGV in non-debug version of hazelcast library, or abort debug of hazelcast library (see below)

Core was generated by `sif /opt/engine/conf/sif.cfg'.
Program terminated with signal 6, Aborted.
#0 0x00dca430 in __kernel_vsyscall ()
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.209.el6_9.2.i686 keyutils-libs-1.4-5.el6.i686 krb5-libs-1.10.3-42.el6.i686 libcom_err-1.41.12-22.el6.i686 libgcc-4.4.7-16.el6.i686 libpcap-0.9.4-15.el5.i386 libselinux-2.0.94-5.8.el6.i686 libstdc++-4.4.7-16.el6.i686 openssl098e-0.9.8e-20.el6.centos.1.i686 zlib-1.2.3-29.el6.i686
(gdb) bt
#0 0x00dca430 in __kernel_vsyscall ()
#1 0x007cb781 in raise () from /lib/libc.so.6
#2 0x007cd05a in abort () from /lib/libc.so.6
#3 0x007c4a4b in __assert_fail_base () from /lib/libc.so.6
#4 0x007c4b06 in __assert_fail () from /lib/libc.so.6
#5 0x012cbf62 in hazelcast::util::Mutex::lock (this=0x990de94)
at /home/swhiting/dev/third-party/hazelcast-cpp-client-master-jul-2nd/hazelcast/src/hazelcast/util/Mutex.cpp:77
#6 0x012cb98d in hazelcast::util::LockGuard::LockGuard (this=0xe6fece70, mutex=...)
at /home/swhiting/dev/third-party/hazelcast-cpp-client-master-jul-2nd/hazelcast/src/hazelcast/util/LockGuard.cpp:26
#7 0x012cbcb8 in hazelcast::util::AtomicBoolean::compareAndSet (this=0x990de94, compareValue=true, setValue=false)
at /home/swhiting/dev/third-party/hazelcast-cpp-client-master-jul-2nd/hazelcast/src/hazelcast/util/AtomicBoolean.cpp:44
#8 0x011cceba in hazelcast::client::spi::LifecycleService::shutdown (this=0x990de5c)
at /home/swhiting/dev/third-party/hazelcast-cpp-client-master-jul-2nd/hazelcast/src/hazelcast/client/spi/LifecycleService.cpp:73
#9 0x01257112 in hazelcast::client::connection::ClientConnectionManagerImpl::ShutdownTask::run (this=0xf26007d8)
at /home/swhiting/dev/third-party/hazelcast-cpp-client-master-jul-2nd/hazelcast/src/hazelcast/client/connection/ClientConnectionManagerImpl.cpp:801
#10 0x0818356e in hazelcast::util::Thread::runnableThread (args=0xf26007d8)
at ../target/deps/hazelcast-3.10.1-ei/include/hazelcast/util/PosixThread.inl:70
#11 0x00b2dbc9 in start_thread () from /lib/libpthread.so.0
#12 0x0088404e in clone () from /lib/libc.so.6
(gdb) frame 5
#5 0x012cbf62 in hazelcast::util::Mutex::lock (this=0x990de94)
at /home/swhiting/dev/third-party/hazelcast-cpp-client-master-jul-2nd/hazelcast/src/hazelcast/util/Mutex.cpp:77
77 /home/swhiting/dev/third-party/hazelcast-cpp-client-master-jul-2nd/hazelcast/src/hazelcast/util/Mutex.cpp: No such file or directory.
in /home/swhiting/dev/third-party/hazelcast-cpp-client-master-jul-2nd/hazelcast/src/hazelcast/util/Mutex.cpp
(gdb) print mutex
$1 = {__data = {__lock = 0, __count = 0, __owner = 0, __kind = -1, __nusers = 0, {__spins = 0, __list = {__next = 0x0}}},
__size = '\000' <repeats 12 times>"\377, \377\377\377\000\000\000\000\000\000\000", __align = 0}
(gdb) print error
$2 = 22

@ihsandemir
Copy link
Collaborator Author

@siwhiting Thanks for the feedback, seems like we could not solve the problem yet. I commented here: #429 (comment) and reopened the issue. We need a better approach to not let the client instance impl. destruct until the shutdown thread completes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants