add shutdown controller to force exit on stalled shutdown #4051

jsvd · 2015-10-16T15:41:28Z

start logstash with --force-shutdown to force exit on stallled shutdown
stall detection kicks in when SIGTERM/SIGINT is received
check if inflight event count isn't going down and if there are blocked/blocking plugin threads
abort logstash if a stall is detected and --force-shutdown is enabled

NOTE TO REVIEWERS: I'm still working on the tests for the ShutdownController

To experiment with the shutdown controller use pipelines like
bin/logstash -e "input { generator { count => 10000 } } filter { ruby { code => 'sleep 5' } } output { tcp { port => 3333 host => localhost workers => 2} }" -w 2 --force-shutdown

or
bin/logstash -e 'input { generator { } } filter { ruby { code => "sleep 10000" } } output { stdout { codec => dots } }' -w 1 --force-shutdown

With and without "--force-shutdown"

jsvd · 2015-10-16T15:43:44Z

@colinsurprenant what do you think about this? I had this branch getting bit rot in my local git

jordansissel · 2015-10-21T18:03:24Z

lib/logstash/agent.rb

@@ -50,6 +50,11 @@ class LogStash::Agent < Clamp::Command
    I18n.t("logstash.agent.flag.configtest"),
    :attribute_name => :config_test

+  option "--[no-]force-shutdown", :flag,


not surethe flag really indicates the behavior. Something like --allow-unsafe-shutdown ?

+1

I would personally prefer --unsafe-shutdown. It makes it clear that it's a risk.

+1 for --allow-unsafe-shutdown

colinsurprenant · 2015-10-21T18:13:44Z

I did not look into the details of the PR but does having a --force-shutdown makes sense conceptually? How will a logstash user decide to use this or not? Or from another angle, does it even make sense to support shutdown stall? here by support I mean that we acknowledge that this is an acceptable and possible behaviour. Instead shouldn't we just fix the shutdown stall problem and guarantee that when you trigger a shutdown, it will shutdown (whatever that mean but ideally without data loss).

jordansissel · 2015-10-21T18:14:24Z

lib/logstash/dead_letter_post_office.rb

+      logger.warn("dead letter received!", :event => event.to_hash)
+      event.tag("_dead_letter")
+      event.cancel
+      @destination << event


What's the thread-safety here? Multiple plugins could be doing this simultaneously. It's possible @file.puts is not threadsafe (further down in your File destination), so we might want to protect this with some kind of lock.

updated by surrounding this code with Thread.exclusive

jordansissel · 2015-10-21T19:13:39Z

Regarding @colinsurprenant's comment, I agree on the concept of this but also sit with Colin here that maybe the flag shouldn't be even available. Ideallyy we'd always have this dead lettering available, so an instant shutdown was always possible due to dead lettering functionality.

Thoughts?

jsvd · 2015-10-22T11:20:38Z

+1 on all of @colinsurprenant's comments. The flag comes from my uncertainty of the final version of this feature. With the upcoming persistent work, I'm not sure how this DeadLetterPostOffice stuff will survive, even though the stall detection should remain the same.
Also, right now we don't have a mechanism to automatically feed the dead letters back into Logstash at startup, it's just a simple "where's how you can discard messages in a way that it's possible to ingest later".
This is why I'd consider making this an experimental feature, disabled by default in one or 2 minor versions, and them eventually remove the setting and make it on by default

colinsurprenant · 2015-10-22T14:40:42Z

@jsvd ok, makes sense. Maybe we could wait a bit so see how persistence shape up and see how we can/will hook that up here? I am definitely not against moving forward with an experimental feature, it will provide us a better comprehension of the problem space, OTOH we know this will tie up with persistence so if we wait a bit for this part it will avoid redundant work. But OTOOH :P I think that having a simple json event dump could still be a good choice regardless of persistence, it is simple and provides an very practical way for users to inspect/manipulate these events.

On a side note: Isn't a Dead Letter (Queue) normally specific to "poison" or unsuccessful events? In the context of a stalled shutdown, the pending events in the queues are not poison events, just normal events that we need to flush to disk to avoid loosing them. Maybe we should change naming here?

jsvd · 2015-10-22T14:45:00Z

According to wikipedia: "Dead letter mail or undeliverable mail is mail that cannot be delivered to the addressee or returned to the sender. "

Futher:
"This is usually due to lack of compliance with postal regulations, an incomplete address and return address, or the inability to forward the mail when both correspondents move before the letter can be delivered."

I think it fits :)

EDIT: as for the OTO+H, I see this as something we can add in 2.1, suggest to any user that has "stall" problems to enable this, see how it works, and then make default/change/remove for 2.2. Having short release cycles enables these experiments, but I understand it can look a bit amateur.

colinsurprenant · 2015-10-22T15:01:21Z

📨 📮 📫

colinsurprenant · 2015-11-02T15:45:34Z

lib/logstash/dead_letter_post_office.rb

+  end
+
+  def self.post(event)
+    Thread.exclusive do


any specific reason for using Thread.exclusive here instead of a specific Mutex?

not really, this is a single purpose synchronisation point, where all threads access it in the same manner buy doing a post on a singleton, so I didn't see a reason to create a specific instance, but I can be convinced otherwise.

jsvd · 2015-11-05T15:52:34Z

@jordansissel (and anyone else): I would like some advice on the dead letter destination. My though is that dumping queues should (for now) be forcefully done to a file on disk in json format, which is what this PR does.
That said, either this path is configurable or not..so:
a) if it's configurable how to do it? is it another flag that only works if stalled detection is enabled? what is the default value?
b) if it's not, what should be the path? a tmpfile? a file inside the logstash installation directory? is it always the same file name or varies (e.g. with time)

My thought idea is to change this PR's default to write the file as something like: logstash/logs/dead_letter.#{time_when_logstash_started}.dump
This way multiple restarts will write multiple files, otherwise multiple restarts would either overwrite the existing dump or append to it, making it harder to understand which stalled events belonged to each run.

andrewvc · 2015-11-05T16:24:18Z

I'd like to say I agree with @colinsurprenant that we shouldn't send in-flight stuff to the DLQ just because of a shutdown. It should only be for poisonous or permanently failing events.

Once persistence is implemented LogStash should be capable of a crash only shutdown, which tackles that problem in a much cleaner way than having to respool selective events.

WRT the DLQ destination, the one design concern I'd like to see addressed is capping the DLQ size. I'd prefer, given that, to just append to a single file vs. write multiple files for each run. If you regularly restart logstash (say via a cron job), it could be easy to blow past any per-file limit.

I'm not sure how we'd accomplish the file locking here (probably a .dlq-lock file containing the PID?).

jsvd · 2015-11-05T16:33:29Z

@andrewvc agreed with all points in the long term when eyeballing the next major version of logstash with persistence.

However, until then, we need to dump the SizedQueues to some place and force the shutdown. In my opinion a lot of this code will probably be deleted once we're able to safely shutdown logstash due to persistence.
Another thing to note here is that what this PR, for now, will only dump the input_to_filter SizeQueue, which will have at most 20 events, so file size should not be much of a concern here.

I may have done the wrong thing here and associate this with dead letters, I'm considering finding another name for this.

andrewvc · 2015-11-05T16:38:34Z

@jsvd why will only the input_to_filter queue be dropped? Won't we lose in-flight events infilter_to_output (to say nothing of the internal buffer in elasticsearch.rb).

IMHO if we really want this feature we should do the full thing without caveats. Dropping stuff in the Elasticsearch queue is not acceptable. I think our biggest roadblock right now is the java core rewrite, and if persistence is that important we can tackle it in ruby earlier than later.

I think it'd be faster to just implement a persistent queue along the guidelines I proposed before in ruby (instead of waiting for a java core rewrite), and let it be a configurable setting for now (for those who don't want to eat the perf cost). In fact, I already implemented a rough PoC of that in ~1 day a few months ago with a not bad performance cost (and that was using Elasticsearch as a persistence backend).

colinsurprenant · 2015-11-05T16:50:23Z

@andrewvc we already have a non core-event java-rewrite based persistent queue implementation and we decided to move on with the core event java rewrite and if we are to move into that direction as we seems to agree we are, as you suggest, we should probably focus on this instead on side-tracking on yet another implementation strategy, no? this sound to me going in full circle...

andrewvc · 2015-11-16T16:19:37Z

lib/logstash/pipeline.rb

@@ -175,7 +179,8 @@ def start_input(plugin)
  end

  def inputworker(plugin)
-    LogStash::Util::set_thread_name("<#{plugin.class.config_name}")
+    LogStash::Util.set_thread_name("<#{plugin.class.config_name}")
+    LogStash::Util.set_thread_plugin(plugin)


Love having this around for troubleshooting!

ph · 2015-11-17T15:31:47Z

After discussing with @jsvd we agreed to move the ShutdownController initialize outside of the pipeline and the pipeline will be pass to the controller this will make the code easier to follow and more extensible.

ph · 2015-11-18T15:22:45Z

lib/logstash/shutdown_controller.rb

+    # * at least REPORT_EVERY reports have been created
+    # * the inflight event count is in monotonically increasing
+    # * there are worker threads running which aren't blocked on SizedQueue pop/push
+    # * the stalled thread list is constant in the previous REPORT_EVERY reports


+1 for the comment

ph · 2015-11-18T15:26:28Z

code looks code, will do some manual testing.

ph · 2015-11-18T15:37:12Z

Maybe add the --allow-unsafe-shutdown to the documentation

Manual tests run successfully.

Stuck pipeline.

~/e/logstash git:pr/4051 ❯❯❯ bin/logstash -e 'input { generator { } } filter { ruby { code => "sleep 10000" } } output { stdout { codec => dots } }' -w 1 --allow-unsafe-shutdown                                                                                                                                                                                       ✱
Default settings used: Filter workers: 1
Logstash startup completed
^CSIGINT received. Shutting down the pipeline. {:level=>:warn}
Received shutdown signal, but pipeline is still waiting for in-flight events
to be processed. Sending another ^C will force quit Logstash, but this may cause
data loss. {:level=>:warn}
 {:level=>:warn, "INFLIGHT_EVENT_COUNT"=>{"input_to_filter"=>20, "total"=>20}, "STALLING_THREADS"=>{["LogStash::Filters::Ruby", {"code"=>"sleep 10000"}]=>[{"thread_id"=>15, "name"=>"|filterworker.0", "current_call"=>"(ruby filter code):1:in `sleep'"}]}}
The shutdown process appears to be stalled due to busy or blocked plugins. Check the logs for more information. {:level=>:error}
 {:level=>:warn, "INFLIGHT_EVENT_COUNT"=>{"input_to_filter"=>20, "total"=>20}, "STALLING_THREADS"=>{["LogStash::Filters::Ruby", {"code"=>"sleep 10000"}]=>[{"thread_id"=>15, "name"=>"|filterworker.0", "current_call"=>"(ruby filter code):1:in `sleep'"}]}}
 {:level=>:warn, "INFLIGHT_EVENT_COUNT"=>{"input_to_filter"=>20, "total"=>20}, "STALLING_THREADS"=>{["LogStash::Filters::Ruby", {"code"=>"sleep 10000"}]=>[{"thread_id"=>15, "name"=>"|filterworker.0", "current_call"=>"(ruby filter code):1:in `sleep'"}]}}
Forcefully quitting logstash.. {:level=>:fatal}

Normal healthy pipeline.

~/e/logstash git:pr/4051 ❯❯❯ bin/logstash -e 'input { generator { } } output { stdout { codec => dots } }' -w 1 --allow-unsafe-shutdown                                                                                                                                                                                                                               ⏎ ✱
Default settings used: Filter workers: 1
............................................................................................Logstash startup completed
.................................................................................................................................................................SIGINT received. Shutting down the pipeline. {:level=>:warn}

ph · 2015-11-18T15:40:15Z

@jsvd I have removed the LGTM, the tests are failling on this branch.
Just fix them and its LGTM.

ph · 2015-11-18T15:41:49Z

failling tests:


  1) LogStash::ShutdownController when unsafe_shutdown is true with a non-stalled pipeline should request more than NUM_REPORTS "inflight_count"
     Failure/Error: LogStash::ShutdownController::REPORTS.clear
     NameError:
       uninitialized constant LogStash::ShutdownController::REPORTS
     # ./spec/core/shutdown_controller_spec.rb:12:in `(root)'
     # ./vendor/bundle/jruby/1.9/gems/rspec-wait-0.0.7/lib/rspec/wait.rb:46:in `(root)'
     # ./lib/bootstrap/rspec.rb:11:in `(root)'

  2) LogStash::ShutdownController when unsafe_shutdown is true with a non-stalled pipeline shouldn't force exit after NUM_REPORTS cycles
     Failure/Error: LogStash::ShutdownController::REPORTS.clear
     NameError:
       uninitialized constant LogStash::ShutdownController::REPORTS
     # ./spec/core/shutdown_controller_spec.rb:12:in `(root)'
     # ./vendor/bundle/jruby/1.9/gems/rspec-wait-0.0.7/lib/rspec/wait.rb:46:in `(root)'
     # ./lib/bootstrap/rspec.rb:11:in `(root)'

  3) LogStash::ShutdownController when unsafe_shutdown is true with a stalled pipeline should force exit after NUM_REPORTS cycles
     Failure/Error: LogStash::ShutdownController::REPORTS.clear
     NameError:
       uninitialized constant LogStash::ShutdownController::REPORTS
     # ./spec/core/shutdown_controller_spec.rb:12:in `(root)'
     # ./vendor/bundle/jruby/1.9/gems/rspec-wait-0.0.7/lib/rspec/wait.rb:46:in `(root)'
     # ./lib/bootstrap/rspec.rb:11:in `(root)'

* start logstash with --allow-unsafe-shutdown to force_exit on stalled shutdown * by default --allow-unsafe-shutdown is disabled * stall detection kicks in when SIGTERM/SIGINT is received * check if inflight event count isn't going down and if there are blocked/blocking plugin threads

jsvd · 2015-11-19T16:25:04Z

merged to:
2.1: 1412885
2.x: 79b90b1
master: 5db1d28

jsvd added work in progress pipeline stalls dead-letter-queue labels Oct 16, 2015

suyograo added the v2.1.0 label Oct 16, 2015

suyograo assigned jordansissel Oct 21, 2015

jordansissel reviewed Oct 21, 2015
View reviewed changes

jsvd force-pushed the abort_on_stalled_shutdown branch from f48687c to 9673e45 Compare October 22, 2015 11:38

suyograo added the needs review label Oct 30, 2015

colinsurprenant reviewed Nov 2, 2015
View reviewed changes

ph self-assigned this Nov 16, 2015

andrewvc reviewed Nov 16, 2015
View reviewed changes

ph reviewed Nov 18, 2015
View reviewed changes

ph closed this Nov 18, 2015

ph reopened this Nov 18, 2015

jsvd force-pushed the abort_on_stalled_shutdown branch from 1f2f8d9 to 9e76fa7 Compare November 19, 2015 15:30

jsvd closed this Nov 19, 2015

jsvd mentioned this pull request Nov 19, 2015

stalled outputs will prevent proper shutdown #3451

Closed

7 tasks

andrewvc mentioned this pull request Nov 23, 2015

Next Gen Pipeline #4254

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add shutdown controller to force exit on stalled shutdown #4051

add shutdown controller to force exit on stalled shutdown #4051

jsvd commented Oct 16, 2015

jsvd commented Oct 16, 2015

jordansissel Oct 21, 2015

andrewvc Nov 16, 2015

ph Nov 16, 2015

colinsurprenant commented Oct 21, 2015

jordansissel Oct 21, 2015

jsvd Oct 23, 2015

jordansissel commented Oct 21, 2015

jsvd commented Oct 22, 2015

colinsurprenant commented Oct 22, 2015

jsvd commented Oct 22, 2015

colinsurprenant commented Oct 22, 2015

colinsurprenant Nov 2, 2015

jsvd Nov 2, 2015

jsvd commented Nov 5, 2015

andrewvc commented Nov 5, 2015

jsvd commented Nov 5, 2015

andrewvc commented Nov 5, 2015

colinsurprenant commented Nov 5, 2015

andrewvc Nov 16, 2015

ph commented Nov 17, 2015

ph Nov 18, 2015

ph commented Nov 18, 2015

ph commented Nov 18, 2015

ph commented Nov 18, 2015

ph commented Nov 18, 2015

jsvd commented Nov 19, 2015

add shutdown controller to force exit on stalled shutdown #4051

add shutdown controller to force exit on stalled shutdown #4051

Conversation

jsvd commented Oct 16, 2015

jsvd commented Oct 16, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

colinsurprenant commented Oct 21, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jordansissel commented Oct 21, 2015

jsvd commented Oct 22, 2015

colinsurprenant commented Oct 22, 2015

jsvd commented Oct 22, 2015

colinsurprenant commented Oct 22, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jsvd commented Nov 5, 2015

andrewvc commented Nov 5, 2015

jsvd commented Nov 5, 2015

andrewvc commented Nov 5, 2015

colinsurprenant commented Nov 5, 2015

Choose a reason for hiding this comment

ph commented Nov 17, 2015

Choose a reason for hiding this comment

ph commented Nov 18, 2015

ph commented Nov 18, 2015

ph commented Nov 18, 2015

ph commented Nov 18, 2015

jsvd commented Nov 19, 2015