Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add shutdown controller to force exit on stalled shutdown #4051

Closed
wants to merge 1 commit into from

Conversation

jsvd
Copy link
Member

@jsvd jsvd commented Oct 16, 2015

  • start logstash with --force-shutdown to force exit on stallled shutdown
  • stall detection kicks in when SIGTERM/SIGINT is received
  • check if inflight event count isn't going down and if there are blocked/blocking plugin threads
  • abort logstash if a stall is detected and --force-shutdown is enabled

fixes #3451

NOTE TO REVIEWERS: I'm still working on the tests for the ShutdownController

To experiment with the shutdown controller use pipelines like
bin/logstash -e "input { generator { count => 10000 } } filter { ruby { code => 'sleep 5' } } output { tcp { port => 3333 host => localhost workers => 2} }" -w 2 --force-shutdown

or
bin/logstash -e 'input { generator { } } filter { ruby { code => "sleep 10000" } } output { stdout { codec => dots } }' -w 1 --force-shutdown

With and without "--force-shutdown"

@jsvd
Copy link
Member Author

jsvd commented Oct 16, 2015

@colinsurprenant what do you think about this? I had this branch getting bit rot in my local git

@@ -50,6 +50,11 @@ class LogStash::Agent < Clamp::Command
I18n.t("logstash.agent.flag.configtest"),
:attribute_name => :config_test

option "--[no-]force-shutdown", :flag,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not surethe flag really indicates the behavior. Something like --allow-unsafe-shutdown ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

I would personally prefer --unsafe-shutdown. It makes it clear that it's a risk.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for --allow-unsafe-shutdown

@colinsurprenant
Copy link
Contributor

I did not look into the details of the PR but does having a --force-shutdown makes sense conceptually? How will a logstash user decide to use this or not? Or from another angle, does it even make sense to support shutdown stall? here by support I mean that we acknowledge that this is an acceptable and possible behaviour. Instead shouldn't we just fix the shutdown stall problem and guarantee that when you trigger a shutdown, it will shutdown (whatever that mean but ideally without data loss).

logger.warn("dead letter received!", :event => event.to_hash)
event.tag("_dead_letter")
event.cancel
@destination << event
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the thread-safety here? Multiple plugins could be doing this simultaneously. It's possible @file.puts is not threadsafe (further down in your File destination), so we might want to protect this with some kind of lock.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated by surrounding this code with Thread.exclusive

@jordansissel
Copy link
Contributor

Regarding @colinsurprenant's comment, I agree on the concept of this but also sit with Colin here that maybe the flag shouldn't be even available. Ideallyy we'd always have this dead lettering available, so an instant shutdown was always possible due to dead lettering functionality.

Thoughts?

@jsvd
Copy link
Member Author

jsvd commented Oct 22, 2015

+1 on all of @colinsurprenant's comments. The flag comes from my uncertainty of the final version of this feature. With the upcoming persistent work, I'm not sure how this DeadLetterPostOffice stuff will survive, even though the stall detection should remain the same.
Also, right now we don't have a mechanism to automatically feed the dead letters back into Logstash at startup, it's just a simple "where's how you can discard messages in a way that it's possible to ingest later".
This is why I'd consider making this an experimental feature, disabled by default in one or 2 minor versions, and them eventually remove the setting and make it on by default

@jsvd jsvd force-pushed the abort_on_stalled_shutdown branch from f48687c to 9673e45 Compare October 22, 2015 11:38
@colinsurprenant
Copy link
Contributor

@jsvd ok, makes sense. Maybe we could wait a bit so see how persistence shape up and see how we can/will hook that up here? I am definitely not against moving forward with an experimental feature, it will provide us a better comprehension of the problem space, OTOH we know this will tie up with persistence so if we wait a bit for this part it will avoid redundant work. But OTOOH :P I think that having a simple json event dump could still be a good choice regardless of persistence, it is simple and provides an very practical way for users to inspect/manipulate these events.

On a side note: Isn't a Dead Letter (Queue) normally specific to "poison" or unsuccessful events? In the context of a stalled shutdown, the pending events in the queues are not poison events, just normal events that we need to flush to disk to avoid loosing them. Maybe we should change naming here?

@jsvd
Copy link
Member Author

jsvd commented Oct 22, 2015

According to wikipedia: "Dead letter mail or undeliverable mail is mail that cannot be delivered to the addressee or returned to the sender. "

Futher:
"This is usually due to lack of compliance with postal regulations, an incomplete address and return address, or the inability to forward the mail when both correspondents move before the letter can be delivered."

I think it fits :)

EDIT: as for the OTO+H, I see this as something we can add in 2.1, suggest to any user that has "stall" problems to enable this, see how it works, and then make default/change/remove for 2.2. Having short release cycles enables these experiments, but I understand it can look a bit amateur.

@colinsurprenant
Copy link
Contributor

📨 📮 📫

end

def self.post(event)
Thread.exclusive do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any specific reason for using Thread.exclusive here instead of a specific Mutex?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really, this is a single purpose synchronisation point, where all threads access it in the same manner buy doing a post on a singleton, so I didn't see a reason to create a specific instance, but I can be convinced otherwise.

@jsvd
Copy link
Member Author

jsvd commented Nov 5, 2015

@jordansissel (and anyone else): I would like some advice on the dead letter destination. My though is that dumping queues should (for now) be forcefully done to a file on disk in json format, which is what this PR does.
That said, either this path is configurable or not..so:
a) if it's configurable how to do it? is it another flag that only works if stalled detection is enabled? what is the default value?
b) if it's not, what should be the path? a tmpfile? a file inside the logstash installation directory? is it always the same file name or varies (e.g. with time)

My thought idea is to change this PR's default to write the file as something like: logstash/logs/dead_letter.#{time_when_logstash_started}.dump
This way multiple restarts will write multiple files, otherwise multiple restarts would either overwrite the existing dump or append to it, making it harder to understand which stalled events belonged to each run.

@andrewvc
Copy link
Contributor

andrewvc commented Nov 5, 2015

I'd like to say I agree with @colinsurprenant that we shouldn't send in-flight stuff to the DLQ just because of a shutdown. It should only be for poisonous or permanently failing events.

Once persistence is implemented LogStash should be capable of a crash only shutdown, which tackles that problem in a much cleaner way than having to respool selective events.

WRT the DLQ destination, the one design concern I'd like to see addressed is capping the DLQ size. I'd prefer, given that, to just append to a single file vs. write multiple files for each run. If you regularly restart logstash (say via a cron job), it could be easy to blow past any per-file limit.

I'm not sure how we'd accomplish the file locking here (probably a .dlq-lock file containing the PID?).

@jsvd
Copy link
Member Author

jsvd commented Nov 5, 2015

@andrewvc agreed with all points in the long term when eyeballing the next major version of logstash with persistence.

However, until then, we need to dump the SizedQueues to some place and force the shutdown. In my opinion a lot of this code will probably be deleted once we're able to safely shutdown logstash due to persistence.
Another thing to note here is that what this PR, for now, will only dump the input_to_filter SizeQueue, which will have at most 20 events, so file size should not be much of a concern here.

I may have done the wrong thing here and associate this with dead letters, I'm considering finding another name for this.

@andrewvc
Copy link
Contributor

andrewvc commented Nov 5, 2015

@jsvd why will only the input_to_filter queue be dropped? Won't we lose in-flight events infilter_to_output (to say nothing of the internal buffer in elasticsearch.rb).

IMHO if we really want this feature we should do the full thing without caveats. Dropping stuff in the Elasticsearch queue is not acceptable. I think our biggest roadblock right now is the java core rewrite, and if persistence is that important we can tackle it in ruby earlier than later.

I think it'd be faster to just implement a persistent queue along the guidelines I proposed before in ruby (instead of waiting for a java core rewrite), and let it be a configurable setting for now (for those who don't want to eat the perf cost). In fact, I already implemented a rough PoC of that in ~1 day a few months ago with a not bad performance cost (and that was using Elasticsearch as a persistence backend).

@colinsurprenant
Copy link
Contributor

@andrewvc we already have a non core-event java-rewrite based persistent queue implementation and we decided to move on with the core event java rewrite and if we are to move into that direction as we seems to agree we are, as you suggest, we should probably focus on this instead on side-tracking on yet another implementation strategy, no? this sound to me going in full circle...

@ph ph self-assigned this Nov 16, 2015
@@ -175,7 +179,8 @@ def start_input(plugin)
end

def inputworker(plugin)
LogStash::Util::set_thread_name("<#{plugin.class.config_name}")
LogStash::Util.set_thread_name("<#{plugin.class.config_name}")
LogStash::Util.set_thread_plugin(plugin)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love having this around for troubleshooting!

@ph
Copy link
Contributor

ph commented Nov 17, 2015

After discussing with @jsvd we agreed to move the ShutdownController initialize outside of the pipeline and the pipeline will be pass to the controller this will make the code easier to follow and more extensible.

# * at least REPORT_EVERY reports have been created
# * the inflight event count is in monotonically increasing
# * there are worker threads running which aren't blocked on SizedQueue pop/push
# * the stalled thread list is constant in the previous REPORT_EVERY reports
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for the comment

@ph
Copy link
Contributor

ph commented Nov 18, 2015

code looks code, will do some manual testing.

@ph
Copy link
Contributor

ph commented Nov 18, 2015

Maybe add the --allow-unsafe-shutdown to the documentation

Manual tests run successfully.

Stuck pipeline.

~/e/logstash git:pr/4051 ❯❯❯ bin/logstash -e 'input { generator { } } filter { ruby { code => "sleep 10000" } } output { stdout { codec => dots } }' -w 1 --allow-unsafe-shutdown                                                                                                                                                                                       ✱
Default settings used: Filter workers: 1
Logstash startup completed
^CSIGINT received. Shutting down the pipeline. {:level=>:warn}
Received shutdown signal, but pipeline is still waiting for in-flight events
to be processed. Sending another ^C will force quit Logstash, but this may cause
data loss. {:level=>:warn}
 {:level=>:warn, "INFLIGHT_EVENT_COUNT"=>{"input_to_filter"=>20, "total"=>20}, "STALLING_THREADS"=>{["LogStash::Filters::Ruby", {"code"=>"sleep 10000"}]=>[{"thread_id"=>15, "name"=>"|filterworker.0", "current_call"=>"(ruby filter code):1:in `sleep'"}]}}
The shutdown process appears to be stalled due to busy or blocked plugins. Check the logs for more information. {:level=>:error}
 {:level=>:warn, "INFLIGHT_EVENT_COUNT"=>{"input_to_filter"=>20, "total"=>20}, "STALLING_THREADS"=>{["LogStash::Filters::Ruby", {"code"=>"sleep 10000"}]=>[{"thread_id"=>15, "name"=>"|filterworker.0", "current_call"=>"(ruby filter code):1:in `sleep'"}]}}
 {:level=>:warn, "INFLIGHT_EVENT_COUNT"=>{"input_to_filter"=>20, "total"=>20}, "STALLING_THREADS"=>{["LogStash::Filters::Ruby", {"code"=>"sleep 10000"}]=>[{"thread_id"=>15, "name"=>"|filterworker.0", "current_call"=>"(ruby filter code):1:in `sleep'"}]}}
Forcefully quitting logstash.. {:level=>:fatal}

Normal healthy pipeline.

~/e/logstash git:pr/4051 ❯❯❯ bin/logstash -e 'input { generator { } } output { stdout { codec => dots } }' -w 1 --allow-unsafe-shutdown                                                                                                                                                                                                                               ⏎ ✱
Default settings used: Filter workers: 1
............................................................................................Logstash startup completed
.................................................................................................................................................................SIGINT received. Shutting down the pipeline. {:level=>:warn}

@ph
Copy link
Contributor

ph commented Nov 18, 2015

@jsvd I have removed the LGTM, the tests are failling on this branch.
Just fix them and its LGTM.

@ph ph closed this Nov 18, 2015
@ph ph reopened this Nov 18, 2015
@ph
Copy link
Contributor

ph commented Nov 18, 2015

failling tests:


  1) LogStash::ShutdownController when unsafe_shutdown is true with a non-stalled pipeline should request more than NUM_REPORTS "inflight_count"
     Failure/Error: LogStash::ShutdownController::REPORTS.clear
     NameError:
       uninitialized constant LogStash::ShutdownController::REPORTS
     # ./spec/core/shutdown_controller_spec.rb:12:in `(root)'
     # ./vendor/bundle/jruby/1.9/gems/rspec-wait-0.0.7/lib/rspec/wait.rb:46:in `(root)'
     # ./lib/bootstrap/rspec.rb:11:in `(root)'

  2) LogStash::ShutdownController when unsafe_shutdown is true with a non-stalled pipeline shouldn't force exit after NUM_REPORTS cycles
     Failure/Error: LogStash::ShutdownController::REPORTS.clear
     NameError:
       uninitialized constant LogStash::ShutdownController::REPORTS
     # ./spec/core/shutdown_controller_spec.rb:12:in `(root)'
     # ./vendor/bundle/jruby/1.9/gems/rspec-wait-0.0.7/lib/rspec/wait.rb:46:in `(root)'
     # ./lib/bootstrap/rspec.rb:11:in `(root)'

  3) LogStash::ShutdownController when unsafe_shutdown is true with a stalled pipeline should force exit after NUM_REPORTS cycles
     Failure/Error: LogStash::ShutdownController::REPORTS.clear
     NameError:
       uninitialized constant LogStash::ShutdownController::REPORTS
     # ./spec/core/shutdown_controller_spec.rb:12:in `(root)'
     # ./vendor/bundle/jruby/1.9/gems/rspec-wait-0.0.7/lib/rspec/wait.rb:46:in `(root)'
     # ./lib/bootstrap/rspec.rb:11:in `(root)'

* start logstash with --allow-unsafe-shutdown to force_exit on stalled shutdown
* by default --allow-unsafe-shutdown is disabled
* stall detection kicks in when SIGTERM/SIGINT is received
* check if inflight event count isn't going down and if there are blocked/blocking plugin threads
@jsvd jsvd force-pushed the abort_on_stalled_shutdown branch from 1f2f8d9 to 9e76fa7 Compare November 19, 2015 15:30
@jsvd
Copy link
Member Author

jsvd commented Nov 19, 2015

merged to:
2.1: 1412885
2.x: 79b90b1
master: 5db1d28

@jsvd jsvd closed this Nov 19, 2015
@andrewvc andrewvc mentioned this pull request Nov 23, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

stalled outputs will prevent proper shutdown
6 participants