-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Thread hangs on exit #50
Comments
One workaround would be calling close! in at_exit { } |
Just wanted to let you know I got this. I'm not working today, but I'll try Jonathan Simms On Sep 26, 2012, at 9:07, tsilen notifications@github.com wrote: One workaround would be calling close! in at_exit { } — |
Any update on this one? |
+1 |
hey guys, sorry, been bogged down with holidays and nonsense, i'll try to get to this one this week |
After a quick look, it seems the problem is (as always) 1.8.7 compatibility (yes, i know, talk to @eric), where only Monitor's condvar takes a timeout. I'll have to review it a little more closely. I like the I'll try to review the places where wait() is being called and at least get a branch together that people can test against. |
Thanks for taking a look! With regards to most people not having many ZK instances - that might not always be true if you're using redis_failover with a connection pool (for example, what's used in Sidekiq). Sidekiq lets you create a connection pool with e.g. 10 or 20 redis clients. Each RedisFailover::Client instance will contain an underlying ZK instance. |
By the way, I just updated (in master) RedisFailover::Client so that you can pass in an existing ZK client instance. You would use it like this: zk = ZK.new('localhost:2181,localhost:2182,localhost:2183,localhost:2184,localhost:2185')
pool = ConnectionPool.new(:size => 20) { RedisFailover::Client.new(:zk => zk) }
pool.with { |client| client.get('foo') } |
this is probably how it should be done, just in terms of resource usage On Mon, Oct 8, 2012 at 3:37 PM, Ryan LeCompte notifications@git.luolix.topwrote:
|
Ok, so I think the test @tsilen came up with shows a bug in ruby. I've run into issues before with at-exit GC and thread primitives (in 1.8, but still). If you modify the above torture test: #!/usr/bin/env ruby
require 'rubygems'
require 'zk'
require 'logger'
$stdout.sync = true
$log = Logger.new($stderr).tap { |n| n.level = Logger::DEBUG }
@zkservers = "localhost:2181"
trap 'TTIN' do
Thread.list.each do |thread|
puts "Thread TID-#{thread.object_id.to_s(36)}"
puts thread.backtrace.join("\n")
end
end
def do_something
ZK.open(@zkservers) do |zk|
thr_id = Thread.current.object_id.to_s(36)
$log.debug { "#{thr_id} #{zk.children('/').inspect}" }
end
sleep 1
end
puts "Pid: #{$$}"
count = 50
stack = []
(0..count).each do |i|
stack << Thread.new { do_something }
end
# sleep rand(0)
stack.each do |th|
begin
th.join(3)
rescue Exception => e
puts "kill: #{e.class} #{e.message}"
end
end By waiting for the threads to exit before "falling off the end of the script" I can't get it to hang. |
Yes, the issue seems to happen when a thread is alive and in some specific state at the end of the script, so waiting for them to exit should also do the trick. I did some quick testing with the updated script and didn't get it to hang either, but I'll test a bit more tomorrow. |
Ok, cool. I'm like 99% sure that this is a bug in the way ruby does ConditionVariable cleanup. |
Looks like close! in at_exit indeed doesn't work in all cases. This is a quite a big problem, because for example rails (rack) processes many times hang when they should exit when released from e.g. Passenger's pool and keep eating a lot of resources. I think it's something that has to be handled within zk. |
Yeah, you need to close your connections before "falling off the end of the script" as above. You cannot rely on ruby to shut down these clients properly because it's not handling cleaning up POSIX thread primitives properly. I tested in an at_exit and it didn't work, I had to wait for all of the clients to finish closing and join all threads before reaching the end. I'm not sure how to handle this in terms of passenger, but I'd imagine that there's some kind of shutdown you can hook into. Again, there's really nothing I can do on the ZK end, as I don't control the main thread. The only other thing you can do is do your work inside of a ZK.open() block, which will ensure that zk is shutdown when the block returns. |
Can you modify the code to use 1.9's condvar api when running in 1.9? |
We are seeing sidekiq hang occasionally in production on exit and yep we're running redis_failover. |
If you close ZK before the process exits, the hang doesn't happen because The reason i have to use Monitor is because @eric, who's been a major On Fri, Dec 7, 2012 at 11:46 AM, Mike Perham notifications@git.luolix.topwrote:
|
Looking at the 1.9 code for monitor, it looks like it's just using a standard |
Except that ConditionVariable and Mutex in 1.9 have real posix thread On Fri, Dec 7, 2012 at 2:35 PM, Eric Lindvall notifications@git.luolix.topwrote:
|
If you are suggesting there's a bug in Ruby 1.9, that's quite possible. I found a bug in CondVar last year, made a reproducible test case and had it fixed in head the next morning. |
Yes, the consensus is that there is a bug in the 1.9 Does sidekiq do graceful shutdowns? If so, it's likely you could work around this by calling |
Where's the redmine ticket for the bug? Sidekiq knows nothing about ZK. It's using the redis client, which is wrapped by the redis_failover client. Sidekiq does not bother to close sockets / clients before shutting down and redis_failover does not provide a close/shutdown operation. |
Someone will have to pipe up if they've isolated the problem into a reproducible testcase, but from reading the thread you and @tsilen have described it as an "occasional" issue, which would lead me to believe neither of you have isolated it. We're all volunteers here. I run ruby 1.8, and haven't run into this specific issue, but hopefully one of the self-motivated people on this thread who does run 1.9 and have been impacted by this issue could find the time to isolate it and submit a redmine ticket. In the mean time, it may be worth investigating providing a mechanism in sidekiq for users of the library to register callbacks for performing a clean shutdown which would allow a user to perform a graceful shutdown of the redis_failover or zk client. |
FYI, redis_failover does provide the RedisFailover::Client#shutdown method On Fri, Dec 7, 2012 at 12:54 PM, Eric Lindvall notifications@git.luolix.topwrote:
|
"Where's the redmine ticket for the bug?" You're kidding. On Fri, Dec 7, 2012 at 3:47 PM, Mike Perham notifications@git.luolix.topwrote:
|
Ryan, FYI it's not listed here which is linked from your readme: http://rubydoc.info/github/ryanlecompte/redis_failover/master/RedisFailover/Client |
Recently we've run into an issue where a process using zk and redis_failover sometimes fails to die on exit, but instead hangs in an infinite sched_yield() loop taking all CPU. A SIGKILL is required to get rid of it. I guess it has to do with the program exiting without properly closing the connection first, but i guess it should still die cleanly.
gdb output gives this:
After adding some debug-code to ruby side to get a backtrace when the process is hung, I was able to get this:
When trying to reproduce it without redis_failover i was able to get it hang in a similar way, but in a different place:
and
Code in all is somewhat similar (@cond.wait).
Any ideas? Ruby is 1.9.3-p194, but it also happens at least on 1.9.3-p0. ZK 1.7.1, zookeeper 1.3.0. OS: linux ubuntu 12.04.
At least in ruby 1.9.3 you can give a timeout to ConditionVariable's wait(), maybe that would help.
With this I was able to get it hang in a similar way quite often:
Running it in while true; do ./test.rb; done loop until it gets stuck and then kill -TTIN prints the backtraces of the still alive threads.
The text was updated successfully, but these errors were encountered: