-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] RESQUE_PRE_SHUTDOWN_TIMEOUT not working at all on Heroku #1559
Comments
@Erowlin I think I figured out the problem here:
This MonkeyPatch looks to solve the issue for me: Resque::Worker.class_eval do
def unregister_signal_handlers
trap('TERM') do
trap('TERM') do
puts "[resque-override] received second term signal from parent, throwing termexception"
trap('TERM') do
puts "[resque-override] ignoring > 2 term signal"
end
raise Resque::TermException.new("SIGTERM")
end
puts "[resque-override] received first term signal from heroku, ignoring"
end
trap('INT', 'DEFAULT')
begin
trap('QUIT', 'DEFAULT')
trap('USR1', 'DEFAULT')
trap('USR2', 'DEFAULT')
rescue ArgumentError
end
end
end I'm not sure how to make this into a PR without implementing a heroku-specific flag. Most other non-heroku platforms/devops architectures only signal the parent process. I've asked heroku support if there's a way for them to respect the process hierarchy and only signal the parent process. I'll let you know if I find anything. |
Thanks for the info, @iloveitaly. Hopefully heroku can resolve this on their end. |
@chrisccerami My hunch is they don't have any options to change how their infrastructure handles this. I'll let you know when I hear back. Any idea how to best handle this case? I'm thinking a heroku flag would be the only way to cleanly handle this, right? |
Thanks for your answer @iloveitaly. It makes sense actually. However, I'm not sure we can do much on our side. Should we try to sync with the Heroku team and see with them what are our option? |
@chrisccerami @Erowlin Just heard back from heroku support: there's no way to change how their platform signals processes. What do you guys think about some sort of feature flag to handle this case? Signals don't have any metadata so I don't think there's a way to know where the signal is coming from. Also, I've updated the above snipped with some improved log statement and a namespace bug fix. |
I'm not sure I get your point about the feature flag. Could you provide an example @iloveitaly ? |
I believe he means passing in a flag if you're using heroku, so that Resque can handle shutdown the way heroku expects, without interfering with other platforms. |
@chrisccerami that's exactly right—it looks like we need a heroku-specific codepath to adjust how signals are handled when the app is hosted on heroku. |
Ah got it, it makes sense actually. It also means that we have to implement a specific "logic" within the Resque code to properly handle the Heroku behaviour, or it already exists in the Resque codebase? -> Update I believe we have to use your monkey patch until it's merged? |
That's right! @chrisccerami I know there aren't platform-specific feature flags right now in the resque codebase. Any opinions on the best way to implement this? |
I'll have to put some thought into this (perhaps next week when I have a week off), but I'm open to ideas about how we can implement this in a way that doesn't make the code too messy for the generic case. |
Sounds good! I hope you don't spend time on this on your week off though! |
What about implementing the heroku specific code in a plugin such as Just a random thought. |
I like this a lot! Downside would be having to monkeypatch core heroku. However, if the Pulling out platform specific flags into a separate gem does seem to make the most the sense to me. I can go ahead and do this unless anyone else has other thoughts! |
I'm happy with that solution. |
Hey All! I've created a gem to wrap this patch: |
It turns out Heroku sends kill signals directly to all child processes of a Resque worker, which Resque is not designed to handle. That causes jobs to get killed without get logged as killed, and so they never get retried later. I'm pretty sure this is one of the causes I'm seeing of hanging import jobs for Internet Archive. See this GitHub issue for more about why the gem exists: resque/resque#1559
I'm curious if anyone here has experienced this same issue with Amazon ECS, I'm having a hard time rescuing |
I'm currently debugging the same issue. So far it seems like the TermException, which should get raised in the trap context of the forked child is not passed up to the parent. I was able to reproduce by starting a resque worker inside a docker container and than send a term signal to the container: I'd expect to have a TermException be raise, but it does not. when trying to raise it via a own trap handle
I get the first log entry, but the exception is never raised and, of course, also not the second log entry. |
On a superficial read, doesn't the undocumented I mean, provided (You would lose Resque's child termination logic.) |
I believe the solution implemented by https://github.com/resque/resque-heroku-signals is fundamentally flawed, because it ignores the documented caveat that processes can get
The assumption in that gem is that the second I am thinking about alternative approaches. |
Hi,
I'm coming back again about the
RESQUE_PRE_SHUTDOWN_TIMEOUT
env variable, it's not working at all after doing some tests on a staging environment.Here is the configuration I've made:
#Procfile one_worker_and_scheduler: WORKER_COUNT=1 bin/start-stunnel bin/worker_and_scheduler
# Heroku environment variables: RESQUE_PRE_SHUTDOWN_TIMEOUT=6 RESQUE_TERM_TIMEOUT=6 TERM_CHILD=1
Then I do this in another terminal tab:
heroku restart --remote staging
And in my worker logs, I can see this:
As you can see, there is no delay between
and the job actually being killed completly.
When I read this: https://github.com/resque/resque#signals-on-heroku
It seems that the delay should always be < 10 seconds (The time Heroku send a SIGKILL).
Am I doing something wrong or it does not work?
Happy to give more information, as it's quite critical for a bunch of people out there I believe.
Thank you!
The text was updated successfully, but these errors were encountered: