-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
exception in output thread silently locks up logstash #2130
Comments
+1, got the exact same problem. |
I agree. We'll make sure it gets fixed! Sorry you've spent so much time on this. |
On my "what version" question - I see in the stack trace meniton of "logstash.jar" which means you're running Logstash v1.3.3 or older. I don't know if this solves anything for you, but Logstash 1.4.0 added a ton of text encoding fixes. Notably some automatic correction for really-poorly-encoded text, as well as warnings being logged at input time about bad text encoding (prior behavior was to give some confusing error during output). Hope this helps! Of course, if you see this on Logstash 1.4.x or 1.5.0.beta1, definitely let us know more details :) |
Hey @jordansissel , I'm using version 1.4.2, having installed Logstash in RedHat (via logstash-1.4.2-1_2c0f5a1.noarch.rpm) |
You're right that I was running pre-1.4 -- 1.3.1, I believe. However, when I created this issue, I read the code for 1.4 (and 1.5, I think) and verified that the issue is still present. Check the code; an exception in an output still means that the output thread dies and nothing will read from the SizedQueue. |
The relevant section from trunk: https://github.com/elasticsearch/logstash/blob/master/lib/logstash/pipeline.rb#L225 Input crashes result in a retry: https://github.com/elasticsearch/logstash/blob/master/lib/logstash/pipeline.rb#L187 A crash in one filterworker thread causes all filters to be torn down (?!): https://github.com/elasticsearch/logstash/blob/master/lib/logstash/pipeline.rb#L222 That last may warrant its own bug report. |
@lexelby could you provide a sample config or at least a subset of your config for us to recreate the issue. Depending on the behavior of the plugin used, experience may vary per use case. Which inputs, filters, outputs are you using which are causing this behavior? |
This bug has absolutely nothing to do with the specific output plugin used. About the only relevant attribute of the plugin is whether it's multithreaded, and I'm not even sure that matters. To reproduce this bug, write a very simple output plugin that prints messages to stdout. When a certain keyword is present in the message, it should instead raise or cause to be raised an exception and not catch it. Then set up a pipeline with just a stdin input and that output, with no filters. Send a few messages that don't trigger the exception just to make sure it works. Then send one that triggers an exception. I believe you'll see a traceback at this point but Logstash won't exit. Now, when you enter more messages, they won't be printed out. After 20 messages, the internal SizedQueue will fill and logstash will stop receiving messages. If you needed to debug this, you could create a filter that also printed messages to stdout so that you could observe the fact that the entire pipeline has stalled. |
@jordansissel @talevy turning this issue into a code question: Would it be meaningful to apply a similar retry logic in pipeline#outputworker as it is done in pipeline#inputworker , for sure not an optimal solution but far more easy to detect from log barfs |
We clearly need more predictable behaviour in the presence of exceptions across all the plugins types (input,output,filter,codec).
However, each plugin can have its own being/rescue/retry blocks and thus prevent exceptions reaching the core logstash code. I think the correct behaviour we want in the plugin workers (core pipeline.rb code) is either crash immediately or to retry on failure up to a certain number of times per unit of time (e.g. 3 exceptions per 2 minutes). Retrying indefinitely in filters and outputs will result data loss because we're pulling data from inputs in the meantime. Opinions? |
Could we please remove the "needs more info" tag on this issue? I've explained why I don't think it needs more info, and I've explained the bug in a great deal of detail. I could submit a pull request if you feel that that would be necessary, but I'd be much more interested in hearing the devs' thoughts on the best solution here. |
Thanks!! |
This issue is currently killing us also. |
Is there any update on this? Is it planned to be addressed in 1.5? |
I had this problem as well. My solution was just to use i18n filter on everything first. It removes all special characters so all filters that follow will have no issues with bogus data. |
A similar issue can occur, for example, when failing to configure the cluster name in the elasticsearch output plugin.
It took me a while to figure out what was going on. It'd be nice of the ES output plugin could at least log an event when it is unable to connect to the configured cluster name. In my case, the plugin was just trying the default |
Hello, was there ever a fix to this exception? We recently updated to 2.3.4 (July 7, 2016) from a 1:2.3.3-1 that was experiencing exception issues. Please update and thanks. |
Going to close this as it seems to be:
If you feel this is in error, please open a new ticket in the repository for the specific plugin. |
@untergeek: I find this very frustrating. I previously explained that this is not a per-plugin issue, but instead a design flaw in the way that outputs are run. See my comment above for more detail. I did a quick code dive and it appears that the problem still exists, so this issue is not outdated. Check out
Persistent queues don't help. I mean, sure, they make it so that you don't actually lose in-memory messages if all of your worker threads die. But you still end up with logstash stalled, just silently sitting there not processing any messages. PQs have the same https://github.com/elastic/logstash/blob/2c1736531561dd64d51e923b635618c8b7ee904f/docs/static/persistent-queues.asciidoc#handling-back-pressure mechanism that stalls the pipeline if no worker threads are running. The solution is still simple: catch exceptions from outputs, log them, and allow the worker thread to continue on. An alternative would be to detect that all worker threads have died and cause Logstash to exit so that monitoring can catch the issue or a supervisor process can restart it. The latter is less optimal because it means that Logstash slowly degrades in event throughput over time as it has fewer and fewer living worker threads. In summary: I recommend you reopen this. Since I no longer run logstash, I don't have a dog in this fight anymore other than a) my previously invested time, and b) trying to make a great open source project greater. So I'm going to drop this unless asked directly for clarification. |
@lexelby I totally agree with you on this, but the good news is we still don't need to reopen this since the issue was fixed a long time ago. Logstash will now fully crash when an output raise an unexpected error. This is the correct behavior IMHO, since outputs should never raise exceptions to the pipeline. That is considered a bug because there's one 'right' error handling strategy. The only safe thing to do is crash. |
AWESOME! I agree, that's a great strategy. It surfaces the problem to the operator so that they can report the bug upstream. Thanks for clarifying! |
Thanks for following up on this @lexelby :) |
I'm copying this from the logstash JIRA because I've run into it yet again. This is a pretty serious problem and I've wasted a lot of time debugging it.
If an output plugin encounters an exception (such as LOGSTASH-1504 et. al), the output crashes silently. The exception is printed to stdout and never reaches the log. This causes that output's SizedQueue to no longer be consumed, and eventually filter workers and then input workers block on SizedQueue pushes. Logstash appears to cease processing messages without warning or explanation.
The only way I was able to debug this was to run logstash in a screen and watch the output. I ran it with -vv because I had no idea why it was locking up, and I set my scrollback to 100000 lines. At this point I was able to see my output thread (elasticsearch_river) crash, although I had to scroll a ways back to find this exception, since the other threads continued on for awhile.
Here's the exception from my output thread:
org.jruby.exceptions.RaiseException: (InvalidByteSequenceError) ""\xB9"" on ASCII-8BIT
at org.jruby.RubyString.encode(org/jruby/RubyString.java:7599)
at json.ext.GeneratorMethods$RbHash.to_json(json/ext/GeneratorMethods.java:71)
at LogStash::Event.to_json(file:/opt/logstash/logstash.jar!/logstash/event.rb:157)
at LogStash::Outputs::ElasticSearchRiver.receive(file:/opt/logstash/logstash.jar!/logstash/outputs/elasticsearch_river.rb:212)
at LogStash::Outputs::Base.handle(file:/opt/logstash/logstash.jar!/logstash/outputs/base.rb:86)
at RUBY.initialize((eval):247)Date filter looking for field {:type=>"syslog", :field=>"timestamp", :level=>:debug, :file=>"/opt/logstash/logstash.jar!/logstash/filters/date.rb", :line=>"197"}
at org.jruby.RubyProc.call(org/jruby/RubyProc.java:271)
at LogStash::Pipeline.output(file:/opt/logstash/logstash.jar!/logstash/pipeline.rb:259)
at RUBY.outputworker(file:/opt/logstash/logstash.jar!/logstash/pipeline.rb:218)
at RUBY.start_outputs(file:/opt/logstash/logstash.jar!/logstash/pipeline.rb:145)
It's irrelevant why my output thread is crashing, but for those curious, our vulnerability scanner sent garbage bytes to the syslog port and logstash happily ate them, until it tried to convert the message to UTF-8 in the JSON codec.
I would set the priority of this to critical if I could. I almost dropped logstash because this issue made it inexplicably unreliable. At least logging the exception would have given me a clue as to how to proceed. A better solution would be to drop/log the offending message and restart the output.
Exceptions in inputs are handled by logging the exception and restarting the input worker. Exceptions in filters are handled by logging the exception and shutting down the filter worker. At least in either of these cases, the sysadmin has some clue about what went wrong.
The text was updated successfully, but these errors were encountered: