Thread hangs on exit #50

tsilen · 2012-09-25T19:29:26Z

Recently we've run into an issue where a process using zk and redis_failover sometimes fails to die on exit, but instead hangs in an infinite sched_yield() loop taking all CPU. A SIGKILL is required to get rid of it. I guess it has to do with the program exiting without properly closing the connection first, but i guess it should still die cleanly.

gdb output gives this:

#0  0x00007fb7e6765a67 in sched_yield () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007fb7e6bd86f7 in gvl_yield (vm=0x1bfdf20, th=<optimized out>) at thread_pthread.c:125
#2  0x00007fb7e6bdac16 in rb_thread_schedule_limits (limits_us=0) at thread.c:1025
#3  rb_thread_schedule_limits (limits_us=0) at thread.c:1033
#4  rb_thread_schedule () at thread.c:1035
#5  0x00007fb7e6bdad5f in rb_thread_terminate_all () at thread.c:375
#6  0x00007fb7e6abf89e in ruby_cleanup (ex=0) at eval.c:140
#7  0x00007fb7e6abfa25 in ruby_run_node (n=0x24f0428) at eval.c:244
#8  0x00000000004007fb in main (argc=3, argv=0x7fff7725e948) at main.c:38

After adding some debug-code to ruby side to get a backtrace when the process is hung, I was able to get this:

Thread TID-t26i0
ruby-1.9.3-p194/lib/ruby/1.9.1/thread.rb:71:in `wait'
shared/bundle/ruby/1.9.1/gems/zk-1.7.1/lib/zk/threadpool.rb:268:in `worker_thread_body'

When trying to reproduce it without redis_failover i was able to get it hang in a similar way, but in a different place:

Thread TID-ccaag
ruby-1.9.3-p194/lib/ruby/1.9.1/thread.rb:71:in `wait'
shared/bundle/ruby/1.9.1/gems/zookeeper-1.3.0/lib/zookeeper/common/queue_with_pipe.rb:59:in `pop'
shared/bundle/ruby/1.9.1/gems/zookeeper-1.3.0/lib/zookeeper/common.rb:56:in `get_next_event'
shared/bundle/ruby/1.9.1/gems/zookeeper-1.3.0/lib/zookeeper/common.rb:94:in `dispatch_thread_body'

and

Thread TID-alg44
rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/thread.rb:71:in `wait'
rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/monitor.rb:110:in `wait'
rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/monitor.rb:122:in `wait_while'
shared/bundle/ruby/1.9.1/gems/zk-1.7.1/lib/zk/client/threaded.rb:533:in `block in reconnect_thread_body'
rubies/ruby-1.9.3-p194/lib/ruby/1.9.1/monitor.rb:211:in `mon_synchronize'

Code in all is somewhat similar (@cond.wait).

Any ideas? Ruby is 1.9.3-p194, but it also happens at least on 1.9.3-p0. ZK 1.7.1, zookeeper 1.3.0. OS: linux ubuntu 12.04.

At least in ruby 1.9.3 you can give a timeout to ConditionVariable's wait(), maybe that would help.

With this I was able to get it hang in a similar way quite often:

#!/usr/bin/env ruby
require 'rubygems'
require 'zk'
$stdout.sync = true
@zkservers = "localhost:2181"

trap 'TTIN' do
  Thread.list.each do |thread|
    puts "Thread TID-#{thread.object_id.to_s(36)}"
    puts thread.backtrace.join("\n")
  end
end

def do_something
  zk = ZK.new(@zkservers)
  puts zk.children('/').inspect 
  sleep 1
end

puts "Pid: #{$$}"
count = 50
stack = []
(0..count).each do |i|
  stack << Thread.new { do_something }
end
sleep rand(0)

Running it in while true; do ./test.rb; done loop until it gets stuck and then kill -TTIN prints the backtraces of the still alive threads.

The text was updated successfully, but these errors were encountered:

tsilen · 2012-09-26T13:07:39Z

One workaround would be calling close! in at_exit { }

slyphon · 2012-09-26T16:43:07Z

Just wanted to let you know I got this. I'm not working today, but I'll try
to get a fix together by the EOW.

Jonathan Simms
(sent from my mobile device)

On Sep 26, 2012, at 9:07, tsilen notifications@github.com wrote:

One workaround would be calling close! in at_exit { }

—
Reply to this email directly or view it on
GitHubhttps://github.com//issues/50#issuecomment-8889029.

ryanlecompte · 2012-10-08T15:12:17Z

Any update on this one?

maxjustus · 2012-10-08T15:19:42Z

+1

slyphon · 2012-10-08T15:35:20Z

hey guys, sorry, been bogged down with holidays and nonsense, i'll try to get to this one this week

slyphon · 2012-10-08T15:50:29Z

After a quick look, it seems the problem is (as always) 1.8.7 compatibility (yes, i know, talk to @eric), where only Monitor's condvar takes a timeout. I'll have to review it a little more closely. I like the at_exit { zk.close! } idea, i'm just not sure if that will cause a memory leak, as it will keep a reference to the zk instance. It shouldn't matter in that most people are going to only use one or two zk instances, but still.

I'll try to review the places where wait() is being called and at least get a branch together that people can test against.

ryanlecompte · 2012-10-08T16:14:06Z

Thanks for taking a look! With regards to most people not having many ZK instances - that might not always be true if you're using redis_failover with a connection pool (for example, what's used in Sidekiq). Sidekiq lets you create a connection pool with e.g. 10 or 20 redis clients. Each RedisFailover::Client instance will contain an underlying ZK instance.

ryanlecompte · 2012-10-08T19:37:52Z

By the way, I just updated (in master) RedisFailover::Client so that you can pass in an existing ZK client instance. You would use it like this:

  zk = ZK.new('localhost:2181,localhost:2182,localhost:2183,localhost:2184,localhost:2185')
  pool = ConnectionPool.new(:size => 20) { RedisFailover::Client.new(:zk => zk) }
  pool.with { |client| client.get('foo') }

slyphon · 2012-10-08T20:34:17Z

this is probably how it should be done, just in terms of resource usage
(open file handles, etc).

On Mon, Oct 8, 2012 at 3:37 PM, Ryan LeCompte notifications@git.luolix.topwrote:

By the way, I just updated (in master) RedisFailover::Client so that you
can pass in an existing ZK client instance. You would use it like this:

zk = ZK.new('localhost:2181,localhost:2182,localhost:2183,localhost:2184,localhost:2185')
pool = ConnectionPool.new(:size => 20) { RedisFailover::Client.new(:zk => zk) }
pool.with { |client| client.get('foo') }

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/50#issuecomment-9237528.

slyphon · 2012-10-08T22:07:33Z

Ok, so I think the test @tsilen came up with shows a bug in ruby. I've run into issues before with at-exit GC and thread primitives (in 1.8, but still). If you modify the above torture test:

#!/usr/bin/env ruby
require 'rubygems'
require 'zk'
require 'logger'

$stdout.sync = true
$log = Logger.new($stderr).tap { |n| n.level = Logger::DEBUG }

@zkservers = "localhost:2181"

trap 'TTIN' do
  Thread.list.each do |thread|
    puts "Thread TID-#{thread.object_id.to_s(36)}"
    puts thread.backtrace.join("\n")
  end
end

def do_something
  ZK.open(@zkservers) do |zk|
    thr_id = Thread.current.object_id.to_s(36)
    $log.debug { "#{thr_id} #{zk.children('/').inspect}" }
  end
  sleep 1
end

puts "Pid: #{$$}"
count = 50
stack = []
(0..count).each do |i|
  stack << Thread.new { do_something }
end
# sleep rand(0)

stack.each do |th|
  begin
    th.join(3)
  rescue Exception => e
    puts "kill: #{e.class} #{e.message}"
  end
end

By waiting for the threads to exit before "falling off the end of the script" I can't get it to hang.

tsilen · 2012-10-09T20:03:55Z

Yes, the issue seems to happen when a thread is alive and in some specific state at the end of the script, so waiting for them to exit should also do the trick. I did some quick testing with the updated script and didn't get it to hang either, but I'll test a bit more tomorrow.

slyphon · 2012-10-09T22:04:53Z

Ok, cool. I'm like 99% sure that this is a bug in the way ruby does ConditionVariable cleanup.

tsilen · 2012-10-26T19:35:42Z

Looks like close! in at_exit indeed doesn't work in all cases. This is a quite a big problem, because for example rails (rack) processes many times hang when they should exit when released from e.g. Passenger's pool and keep eating a lot of resources. I think it's something that has to be handled within zk.

slyphon · 2012-10-26T21:15:12Z

Yeah, you need to close your connections before "falling off the end of the script" as above. You cannot rely on ruby to shut down these clients properly because it's not handling cleaning up POSIX thread primitives properly. I tested in an at_exit and it didn't work, I had to wait for all of the clients to finish closing and join all threads before reaching the end.

I'm not sure how to handle this in terms of passenger, but I'd imagine that there's some kind of shutdown you can hook into.

Again, there's really nothing I can do on the ZK end, as I don't control the main thread. The only other thing you can do is do your work inside of a ZK.open() block, which will ensure that zk is shutdown when the block returns.

mperham · 2012-12-07T16:45:57Z

Can you modify the code to use 1.9's condvar api when running in 1.9?

mperham · 2012-12-07T16:46:35Z

We are seeing sidekiq hang occasionally in production on exit and yep we're running redis_failover.

slyphon · 2012-12-07T19:28:58Z

If you close ZK before the process exits, the hang doesn't happen because
all the underlying posix stuff gets cleaned up. Clean shutdown is key.

The reason i have to use Monitor is because @eric, who's been a major
contributor is still running 1.8, and I've agreed to maintain backwards
compatibility. 1.8's ConditionVariable doesn't have a timeout option, which
makes it really difficult.

On Fri, Dec 7, 2012 at 11:46 AM, Mike Perham notifications@git.luolix.topwrote:

We are seeing sidekiq hang occasionally in production on exit and yep
we're running redis_failover.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/50#issuecomment-11136886.

eric · 2012-12-07T19:35:44Z

Looking at the 1.9 code for monitor, it looks like it's just using a standard ConditionVariable and Mutex under the covers. There shouldn't be any real difference in how it acts on 1.9.

slyphon · 2012-12-07T20:18:22Z

Except that ConditionVariable and Mutex in 1.9 have real posix thread
primitives backing them, and 1.9 doesn't clean up effectively before
exiting, so it deadlocks.

On Fri, Dec 7, 2012 at 2:35 PM, Eric Lindvall notifications@git.luolix.topwrote:

Looking at the 1.9 code for monitor, it looks like it's just using a
standard ConditionVariable and Mutex under the covers. There shouldn't be
any real difference in how it acts on 1.9.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/50#issuecomment-11143151.

mperham · 2012-12-07T20:35:14Z

If you are suggesting there's a bug in Ruby 1.9, that's quite possible. I found a bug in CondVar last year, made a reproducible test case and had it fixed in head the next morning.

eric · 2012-12-07T20:38:56Z

Yes, the consensus is that there is a bug in the 1.9 Mutex/ConditionVariable implementation that causes the VM to hang if there are threads that are waiting on a ConditionVariable when the process exits.

Does sidekiq do graceful shutdowns? If so, it's likely you could work around this by calling close! on the ZK object before exiting.

mperham · 2012-12-07T20:46:59Z

Where's the redmine ticket for the bug?

Sidekiq knows nothing about ZK. It's using the redis client, which is wrapped by the redis_failover client. Sidekiq does not bother to close sockets / clients before shutting down and redis_failover does not provide a close/shutdown operation.

eric · 2012-12-07T20:54:53Z

Someone will have to pipe up if they've isolated the problem into a reproducible testcase, but from reading the thread you and @tsilen have described it as an "occasional" issue, which would lead me to believe neither of you have isolated it.

We're all volunteers here. I run ruby 1.8, and haven't run into this specific issue, but hopefully one of the self-motivated people on this thread who does run 1.9 and have been impacted by this issue could find the time to isolate it and submit a redmine ticket.

In the mean time, it may be worth investigating providing a mechanism in sidekiq for users of the library to register callbacks for performing a clean shutdown which would allow a user to perform a graceful shutdown of the redis_failover or zk client.

ryanlecompte · 2012-12-07T20:57:07Z

FYI, redis_failover does provide the RedisFailover::Client#shutdown method
which closes the ZK connection and closes the underlying Redis clients.

On Fri, Dec 7, 2012 at 12:54 PM, Eric Lindvall notifications@git.luolix.topwrote:

Someone will have to pipe up if they've isolated the problem into a
reproducible testcase, but from reading the thread you and @tsilenhttps://github.com/tsilenhave described it as an "occasional" issue, which would lead me to believe
neither of you have isolated it.

We're all volunteers here. I run ruby 1.8, and haven't run into this
specific issue, but hopefully one of the self-motivated people on this
thread who does run 1.9 and have been impacted by this issue could find the
time to isolate it and submit a redmine ticket.

In the mean time, it may be worth investigating providing a mechanism in
sidekiq for users of the library to register callbacks for performing a
clean shutdown which would allow a user to perform a graceful shutdown of
the redis_failover or zk client.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/50#issuecomment-11145976.

slyphon · 2012-12-07T20:58:28Z

"Where's the redmine ticket for the bug?"

You're kidding.

On Fri, Dec 7, 2012 at 3:47 PM, Mike Perham notifications@git.luolix.topwrote:

Where's the redmine ticket for the bug?

Sidekiq knows nothing about ZK. It's using the redis client, which is
wrapped by the redis_failover client. Sidekiq does not bother to close
sockets / clients before shutting down and redis_failover does not provide
a close/shutdown operation.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/50#issuecomment-11145692.

mperham · 2012-12-07T21:00:10Z

Ryan, FYI it's not listed here which is linked from your readme: http://rubydoc.info/github/ryanlecompte/redis_failover/master/RedisFailover/Client

tsilen mentioned this issue Oct 8, 2012

Hanging rake task due to lock contention when using multiple redis failover client connections in a threadpool ryanlecompte/redis_failover#38

Closed

cheald mentioned this issue Dec 7, 2012

sidekiq randomly(?) hangs. sidekiq/sidekiq#547

Closed

arohter mentioned this issue Jan 8, 2014

Connection handling improvements. ryanlecompte/redis_failover#66

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thread hangs on exit #50

Thread hangs on exit #50

tsilen commented Sep 25, 2012

tsilen commented Sep 26, 2012

slyphon commented Sep 26, 2012

ryanlecompte commented Oct 8, 2012

maxjustus commented Oct 8, 2012

slyphon commented Oct 8, 2012

slyphon commented Oct 8, 2012

ryanlecompte commented Oct 8, 2012

ryanlecompte commented Oct 8, 2012

slyphon commented Oct 8, 2012

slyphon commented Oct 8, 2012

tsilen commented Oct 9, 2012

slyphon commented Oct 9, 2012

tsilen commented Oct 26, 2012

slyphon commented Oct 26, 2012

mperham commented Dec 7, 2012

mperham commented Dec 7, 2012

slyphon commented Dec 7, 2012

eric commented Dec 7, 2012

slyphon commented Dec 7, 2012

mperham commented Dec 7, 2012

eric commented Dec 7, 2012

mperham commented Dec 7, 2012

eric commented Dec 7, 2012

ryanlecompte commented Dec 7, 2012

slyphon commented Dec 7, 2012

mperham commented Dec 7, 2012

Thread hangs on exit #50

Thread hangs on exit #50

Comments

tsilen commented Sep 25, 2012

tsilen commented Sep 26, 2012

slyphon commented Sep 26, 2012

ryanlecompte commented Oct 8, 2012

maxjustus commented Oct 8, 2012

slyphon commented Oct 8, 2012

slyphon commented Oct 8, 2012

ryanlecompte commented Oct 8, 2012

ryanlecompte commented Oct 8, 2012

slyphon commented Oct 8, 2012

slyphon commented Oct 8, 2012

tsilen commented Oct 9, 2012

slyphon commented Oct 9, 2012

tsilen commented Oct 26, 2012

slyphon commented Oct 26, 2012

mperham commented Dec 7, 2012

mperham commented Dec 7, 2012

slyphon commented Dec 7, 2012

eric commented Dec 7, 2012

slyphon commented Dec 7, 2012

mperham commented Dec 7, 2012

eric commented Dec 7, 2012

mperham commented Dec 7, 2012

eric commented Dec 7, 2012

ryanlecompte commented Dec 7, 2012

slyphon commented Dec 7, 2012

mperham commented Dec 7, 2012