-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add shutdown controller to force exit on stalled shutdown #4051
Conversation
@colinsurprenant what do you think about this? I had this branch getting bit rot in my local git |
@@ -50,6 +50,11 @@ class LogStash::Agent < Clamp::Command | |||
I18n.t("logstash.agent.flag.configtest"), | |||
:attribute_name => :config_test | |||
|
|||
option "--[no-]force-shutdown", :flag, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not surethe flag really indicates the behavior. Something like --allow-unsafe-shutdown
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
I would personally prefer --unsafe-shutdown
. It makes it clear that it's a risk.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for --allow-unsafe-shutdown
I did not look into the details of the PR but does having a |
logger.warn("dead letter received!", :event => event.to_hash) | ||
event.tag("_dead_letter") | ||
event.cancel | ||
@destination << event |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the thread-safety here? Multiple plugins could be doing this simultaneously. It's possible @file.puts
is not threadsafe (further down in your File destination), so we might want to protect this with some kind of lock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated by surrounding this code with Thread.exclusive
Regarding @colinsurprenant's comment, I agree on the concept of this but also sit with Colin here that maybe the flag shouldn't be even available. Ideallyy we'd always have this dead lettering available, so an instant shutdown was always possible due to dead lettering functionality. Thoughts? |
+1 on all of @colinsurprenant's comments. The flag comes from my uncertainty of the final version of this feature. With the upcoming persistent work, I'm not sure how this DeadLetterPostOffice stuff will survive, even though the stall detection should remain the same. |
f48687c
to
9673e45
Compare
@jsvd ok, makes sense. Maybe we could wait a bit so see how persistence shape up and see how we can/will hook that up here? I am definitely not against moving forward with an experimental feature, it will provide us a better comprehension of the problem space, OTOH we know this will tie up with persistence so if we wait a bit for this part it will avoid redundant work. But OTOOH :P I think that having a simple json event dump could still be a good choice regardless of persistence, it is simple and provides an very practical way for users to inspect/manipulate these events. On a side note: Isn't a Dead Letter (Queue) normally specific to "poison" or unsuccessful events? In the context of a stalled shutdown, the pending events in the queues are not poison events, just normal events that we need to flush to disk to avoid loosing them. Maybe we should change naming here? |
According to wikipedia: "Dead letter mail or undeliverable mail is mail that cannot be delivered to the addressee or returned to the sender. " Futher: I think it fits :) EDIT: as for the OTO+H, I see this as something we can add in 2.1, suggest to any user that has "stall" problems to enable this, see how it works, and then make default/change/remove for 2.2. Having short release cycles enables these experiments, but I understand it can look a bit amateur. |
📨 📮 📫 |
end | ||
|
||
def self.post(event) | ||
Thread.exclusive do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any specific reason for using Thread.exclusive here instead of a specific Mutex?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not really, this is a single purpose synchronisation point, where all threads access it in the same manner buy doing a post on a singleton, so I didn't see a reason to create a specific instance, but I can be convinced otherwise.
@jordansissel (and anyone else): I would like some advice on the dead letter destination. My though is that dumping queues should (for now) be forcefully done to a file on disk in json format, which is what this PR does. My thought idea is to change this PR's default to write the file as something like: |
I'd like to say I agree with @colinsurprenant that we shouldn't send in-flight stuff to the DLQ just because of a shutdown. It should only be for poisonous or permanently failing events. Once persistence is implemented LogStash should be capable of a crash only shutdown, which tackles that problem in a much cleaner way than having to respool selective events. WRT the DLQ destination, the one design concern I'd like to see addressed is capping the DLQ size. I'd prefer, given that, to just append to a single file vs. write multiple files for each run. If you regularly restart logstash (say via a cron job), it could be easy to blow past any per-file limit. I'm not sure how we'd accomplish the file locking here (probably a .dlq-lock file containing the PID?). |
@andrewvc agreed with all points in the long term when eyeballing the next major version of logstash with persistence. However, until then, we need to dump the SizedQueues to some place and force the shutdown. In my opinion a lot of this code will probably be deleted once we're able to safely shutdown logstash due to persistence. I may have done the wrong thing here and associate this with dead letters, I'm considering finding another name for this. |
@jsvd why will only the input_to_filter queue be dropped? Won't we lose in-flight events infilter_to_output (to say nothing of the internal buffer in elasticsearch.rb). IMHO if we really want this feature we should do the full thing without caveats. Dropping stuff in the Elasticsearch queue is not acceptable. I think our biggest roadblock right now is the java core rewrite, and if persistence is that important we can tackle it in ruby earlier than later. I think it'd be faster to just implement a persistent queue along the guidelines I proposed before in ruby (instead of waiting for a java core rewrite), and let it be a configurable setting for now (for those who don't want to eat the perf cost). In fact, I already implemented a rough PoC of that in ~1 day a few months ago with a not bad performance cost (and that was using Elasticsearch as a persistence backend). |
@andrewvc we already have a non core-event java-rewrite based persistent queue implementation and we decided to move on with the core event java rewrite and if we are to move into that direction as we seems to agree we are, as you suggest, we should probably focus on this instead on side-tracking on yet another implementation strategy, no? this sound to me going in full circle... |
@@ -175,7 +179,8 @@ def start_input(plugin) | |||
end | |||
|
|||
def inputworker(plugin) | |||
LogStash::Util::set_thread_name("<#{plugin.class.config_name}") | |||
LogStash::Util.set_thread_name("<#{plugin.class.config_name}") | |||
LogStash::Util.set_thread_plugin(plugin) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love having this around for troubleshooting!
After discussing with @jsvd we agreed to move the ShutdownController initialize outside of the pipeline and the pipeline will be pass to the controller this will make the code easier to follow and more extensible. |
# * at least REPORT_EVERY reports have been created | ||
# * the inflight event count is in monotonically increasing | ||
# * there are worker threads running which aren't blocked on SizedQueue pop/push | ||
# * the stalled thread list is constant in the previous REPORT_EVERY reports |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for the comment
code looks code, will do some manual testing. |
Maybe add the Manual tests run successfully. Stuck pipeline.
Normal healthy pipeline.
|
@jsvd I have removed the LGTM, the tests are failling on this branch. |
failling tests:
|
* start logstash with --allow-unsafe-shutdown to force_exit on stalled shutdown * by default --allow-unsafe-shutdown is disabled * stall detection kicks in when SIGTERM/SIGINT is received * check if inflight event count isn't going down and if there are blocked/blocking plugin threads
1f2f8d9
to
9e76fa7
Compare
fixes #3451
NOTE TO REVIEWERS: I'm still working on the tests for the ShutdownController
To experiment with the shutdown controller use pipelines like
bin/logstash -e "input { generator { count => 10000 } } filter { ruby { code => 'sleep 5' } } output { tcp { port => 3333 host => localhost workers => 2} }" -w 2 --force-shutdown
or
bin/logstash -e 'input { generator { } } filter { ruby { code => "sleep 10000" } } output { stdout { codec => dots } }' -w 1 --force-shutdown
With and without "--force-shutdown"