Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exception in output thread silently locks up logstash #2130

Closed
lexelby opened this issue Nov 25, 2014 · 24 comments
Closed

exception in output thread silently locks up logstash #2130

lexelby opened this issue Nov 25, 2014 · 24 comments

Comments

@lexelby
Copy link

lexelby commented Nov 25, 2014

I'm copying this from the logstash JIRA because I've run into it yet again. This is a pretty serious problem and I've wasted a lot of time debugging it.

If an output plugin encounters an exception (such as LOGSTASH-1504 et. al), the output crashes silently. The exception is printed to stdout and never reaches the log. This causes that output's SizedQueue to no longer be consumed, and eventually filter workers and then input workers block on SizedQueue pushes. Logstash appears to cease processing messages without warning or explanation.

The only way I was able to debug this was to run logstash in a screen and watch the output. I ran it with -vv because I had no idea why it was locking up, and I set my scrollback to 100000 lines. At this point I was able to see my output thread (elasticsearch_river) crash, although I had to scroll a ways back to find this exception, since the other threads continued on for awhile.
Here's the exception from my output thread:
org.jruby.exceptions.RaiseException: (InvalidByteSequenceError) ""\xB9"" on ASCII-8BIT
at org.jruby.RubyString.encode(org/jruby/RubyString.java:7599)
at json.ext.GeneratorMethods$RbHash.to_json(json/ext/GeneratorMethods.java:71)
at LogStash::Event.to_json(file:/opt/logstash/logstash.jar!/logstash/event.rb:157)
at LogStash::Outputs::ElasticSearchRiver.receive(file:/opt/logstash/logstash.jar!/logstash/outputs/elasticsearch_river.rb:212)
at LogStash::Outputs::Base.handle(file:/opt/logstash/logstash.jar!/logstash/outputs/base.rb:86)
at RUBY.initialize((eval):247)Date filter looking for field {:type=>"syslog", :field=>"timestamp", :level=>:debug, :file=>"/opt/logstash/logstash.jar!/logstash/filters/date.rb", :line=>"197"}
at org.jruby.RubyProc.call(org/jruby/RubyProc.java:271)
at LogStash::Pipeline.output(file:/opt/logstash/logstash.jar!/logstash/pipeline.rb:259)
at RUBY.outputworker(file:/opt/logstash/logstash.jar!/logstash/pipeline.rb:218)
at RUBY.start_outputs(file:/opt/logstash/logstash.jar!/logstash/pipeline.rb:145)

It's irrelevant why my output thread is crashing, but for those curious, our vulnerability scanner sent garbage bytes to the syslog port and logstash happily ate them, until it tried to convert the message to UTF-8 in the JSON codec.

I would set the priority of this to critical if I could. I almost dropped logstash because this issue made it inexplicably unreliable. At least logging the exception would have given me a clue as to how to proceed. A better solution would be to drop/log the offending message and restart the output.

Exceptions in inputs are handled by logging the exception and restarting the input worker. Exceptions in filters are handled by logging the exception and shutting down the filter worker. At least in either of these cases, the sysadmin has some clue about what went wrong.

@bscruzpt
Copy link

bscruzpt commented Dec 5, 2014

+1, got the exact same problem.

@suyograo suyograo added this to the v1.5.0 milestone Dec 18, 2014
@jordansissel
Copy link
Contributor

@lexelby @bscruzpt what version of logstash?

@jordansissel
Copy link
Contributor

This is a pretty serious problem and I've wasted a lot of time debugging it.

I agree. We'll make sure it gets fixed! Sorry you've spent so much time on this.

@jordansissel
Copy link
Contributor

On my "what version" question - I see in the stack trace meniton of "logstash.jar" which means you're running Logstash v1.3.3 or older.

I don't know if this solves anything for you, but Logstash 1.4.0 added a ton of text encoding fixes. Notably some automatic correction for really-poorly-encoded text, as well as warnings being logged at input time about bad text encoding (prior behavior was to give some confusing error during output).
Relevant v1.4.x changelog entry https://github.com/elasticsearch/logstash/blob/v1.4.2/CHANGELOG#L68-L70

Hope this helps! Of course, if you see this on Logstash 1.4.x or 1.5.0.beta1, definitely let us know more details :)

@bscruzpt
Copy link

bscruzpt commented Jan 5, 2015

Hey @jordansissel , I'm using version 1.4.2, having installed Logstash in RedHat (via logstash-1.4.2-1_2c0f5a1.noarch.rpm)

@lexelby
Copy link
Author

lexelby commented Jan 5, 2015

You're right that I was running pre-1.4 -- 1.3.1, I believe. However, when I created this issue, I read the code for 1.4 (and 1.5, I think) and verified that the issue is still present. Check the code; an exception in an output still means that the output thread dies and nothing will read from the SizedQueue.

@lexelby
Copy link
Author

lexelby commented Jan 5, 2015

The relevant section from trunk:

https://github.com/elasticsearch/logstash/blob/master/lib/logstash/pipeline.rb#L225

Input crashes result in a retry:

https://github.com/elasticsearch/logstash/blob/master/lib/logstash/pipeline.rb#L187

A crash in one filterworker thread causes all filters to be torn down (?!):

https://github.com/elasticsearch/logstash/blob/master/lib/logstash/pipeline.rb#L222

That last may warrant its own bug report.

@talevy
Copy link
Contributor

talevy commented Jan 14, 2015

@lexelby could you provide a sample config or at least a subset of your config for us to recreate the issue. Depending on the behavior of the plugin used, experience may vary per use case. Which inputs, filters, outputs are you using which are causing this behavior?

@lexelby
Copy link
Author

lexelby commented Jan 14, 2015

This bug has absolutely nothing to do with the specific output plugin used. About the only relevant attribute of the plugin is whether it's multithreaded, and I'm not even sure that matters. To reproduce this bug, write a very simple output plugin that prints messages to stdout. When a certain keyword is present in the message, it should instead raise or cause to be raised an exception and not catch it.

Then set up a pipeline with just a stdin input and that output, with no filters. Send a few messages that don't trigger the exception just to make sure it works. Then send one that triggers an exception. I believe you'll see a traceback at this point but Logstash won't exit. Now, when you enter more messages, they won't be printed out. After 20 messages, the internal SizedQueue will fill and logstash will stop receiving messages. If you needed to debug this, you could create a filter that also printed messages to stdout so that you could observe the fact that the entire pipeline has stalled.

@wiibaa
Copy link
Contributor

wiibaa commented Jan 14, 2015

@jordansissel @talevy turning this issue into a code question:

Would it be meaningful to apply a similar retry logic in pipeline#outputworker as it is done in pipeline#inputworker , for sure not an optimal solution but far more easy to detect from log barfs

@suyograo suyograo assigned jsvd and unassigned talevy Jan 19, 2015
@jsvd
Copy link
Member

jsvd commented Jan 21, 2015

We clearly need more predictable behaviour in the presence of exceptions across all the plugins types (input,output,filter,codec).

  • input retries constantly after exceptions: not sure if this is desirable as can lead to stuck instances
  • the remaining plugins will exit in the presence of an exception

However, each plugin can have its own being/rescue/retry blocks and thus prevent exceptions reaching the core logstash code.

I think the correct behaviour we want in the plugin workers (core pipeline.rb code) is either crash immediately or to retry on failure up to a certain number of times per unit of time (e.g. 3 exceptions per 2 minutes). Retrying indefinitely in filters and outputs will result data loss because we're pulling data from inputs in the meantime. Opinions?

@lexelby
Copy link
Author

lexelby commented Feb 10, 2015

Could we please remove the "needs more info" tag on this issue? I've explained why I don't think it needs more info, and I've explained the bug in a great deal of detail. I could submit a pull request if you feel that that would be necessary, but I'd be much more interested in hearing the devs' thoughts on the best solution here.

@jsvd
Copy link
Member

jsvd commented Feb 10, 2015

@lexelby removed. I wrote more about this issue here and also created a proposal here

@lexelby
Copy link
Author

lexelby commented Feb 10, 2015

Thanks!!

@blysik
Copy link

blysik commented Feb 16, 2015

This issue is currently killing us also.

@suyograo suyograo modified the milestones: 1.5.1, v1.5.0 Feb 26, 2015
@jsvd jsvd removed the v1.5.0 label Feb 26, 2015
@jsvd jsvd modified the milestones: v1.6.0, v1.5.1 Feb 26, 2015
@ScottChapman
Copy link

Is there any update on this? Is it planned to be addressed in 1.5?

@suyograo suyograo added v1.6.0 and removed confirmed labels May 15, 2015
@suyograo suyograo removed this from the v1.5.1 milestone May 15, 2015
@elvarb
Copy link

elvarb commented May 23, 2015

I had this problem as well. My solution was just to use i18n filter on everything first. It removes all special characters so all filters that follow will have no issues with bogus data.

@inqueue
Copy link
Member

inqueue commented Oct 20, 2015

A similar issue can occur, for example, when failing to configure the cluster name in the elasticsearch output plugin.

bin/logstash -f /etc/logstash/conf.d/1-zimbra-postfix.conf
Oct 19, 2015 7:57:22 PM org.elasticsearch.node.internal.InternalNode <init>
INFO: [logstash-zmb1.zmbdom.tld-28028-13458] version[1.7.0], pid[28028], build[929b973/2015-07-16T14:31:07Z]
Oct 19, 2015 7:57:22 PM org.elasticsearch.node.internal.InternalNode <init>
INFO: [logstash-zmb1.zmbdom.tld-28028-13458] initializing ...
Oct 19, 2015 7:57:22 PM org.elasticsearch.plugins.PluginsService <init>
INFO: [logstash-zmb1.zmbdom.tld-28028-13458] loaded [], sites []
Oct 19, 2015 7:57:23 PM org.elasticsearch.bootstrap.Natives <clinit>
WARNING: JNA not found. native methods will be disabled.
Oct 19, 2015 7:57:24 PM org.elasticsearch.node.internal.InternalNode <init>
INFO: [logstash-zmb1.zmbdom.tld-28028-13458] initialized
Oct 19, 2015 7:57:24 PM org.elasticsearch.node.internal.InternalNode start
INFO: [logstash-zmb1.zmbdom.tld-28028-13458] starting ...
Oct 19, 2015 7:57:24 PM org.elasticsearch.transport.TransportService doStart
INFO: [logstash-zmb1.zmbdom.tld-28028-13458] bound_address {inet[/0:0:0:0:0:0:0:0:9301]}, publish_address {inet[/192.168.5.16:9301]}
Oct 19, 2015 7:57:24 PM org.elasticsearch.discovery.DiscoveryService doStart
INFO: [logstash-zmb1.zmbdom.tld-28028-13458] elasticsearch/ETb8ivnuSeGgOrceTw2O8Q
Oct 19, 2015 7:57:54 PM org.elasticsearch.discovery.DiscoveryService waitForInitialState
WARNING: [logstash-zmb1.zmbdom.tld-28028-13458] waited for 30s and no initial state was set by the discovery
Oct 19, 2015 7:57:54 PM org.elasticsearch.node.internal.InternalNode start
INFO: [logstash-zmb1.zmbdom.tld-28028-13458] started
Failed to install template: waited for [30s] {:level=>:error}
Logstash startup completed
...events truncated...
Got error to send bulk of actions: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];[SERVICE_UNAVAILABLE/2/no master]; {:level=>:error}
Failed to flush outgoing items {:outgoing_count=>10, :exception=>"Java::OrgElasticsearchClusterBlock::ClusterBlockException", :backtrace=>["org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(org/elasticsearch/cluster/block/ClusterBlocks.java:151)", "org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedRaiseException(org/elasticsearch/cluster/block/ClusterBlocks.java:141)", "org.elasticsearch.action.bulk.TransportBulkAction.executeBulk(org/elasticsearch/action/bulk/TransportBulkAction.java:215)", "org.elasticsearch.action.bulk.TransportBulkAction.access$000(org/elasticsearch/action/bulk/TransportBulkAction.java:67)", "org.elasticsearch.action.bulk.TransportBulkAction$1.onFailure(org/elasticsearch/action/bulk/TransportBulkAction.java:153)", "org.elasticsearch.action.support.TransportAction$ThreadedActionListener$2.run(org/elasticsearch/action/support/TransportAction.java:137)", "java.util.concurrent.ThreadPoolExecutor.runWorker(java/util/concurrent/ThreadPoolExecutor.java:1142)", "java.util.concurrent.ThreadPoolExecutor$Worker.run(java/util/concurrent/ThreadPoolExecutor.java:617)", "java.lang.Thread.run(java/lang/Thread.java:745)"], :level=>:warn}
...repeats

It took me a while to figure out what was going on. It'd be nice of the ES output plugin could at least log an event when it is unable to connect to the configured cluster name. In my case, the plugin was just trying the default elasticsearch.

@hokie1999
Copy link

Hello, was there ever a fix to this exception? We recently updated to 2.3.4 (July 7, 2016) from a 1:2.3.3-1 that was experiencing exception issues. Please update and thanks.

@untergeek
Copy link
Member

Going to close this as it seems to be:

  1. very outdated
  2. largely per plugin, rather than a core specific issue
  3. PQ is rendering it less of an issue, including many recent fixes

If you feel this is in error, please open a new ticket in the repository for the specific plugin.

@lexelby
Copy link
Author

lexelby commented Jan 8, 2017

@untergeek: I find this very frustrating.

I previously explained that this is not a per-plugin issue, but instead a design flaw in the way that outputs are run. See my comment above for more detail.

I did a quick code dive and it appears that the problem still exists, so this issue is not outdated. Check out output_batch() in pipeline.rb. The link is to the (current) trunk version.

output_func(event) gets called without being wrapped in a begin...rescue block. If any output allows an exception to reach the top level, that worker thread dies. I don't see any code to restart worker threads.

Persistent queues don't help. I mean, sure, they make it so that you don't actually lose in-memory messages if all of your worker threads die. But you still end up with logstash stalled, just silently sitting there not processing any messages. PQs have the same https://github.com/elastic/logstash/blob/2c1736531561dd64d51e923b635618c8b7ee904f/docs/static/persistent-queues.asciidoc#handling-back-pressure mechanism that stalls the pipeline if no worker threads are running.

The solution is still simple: catch exceptions from outputs, log them, and allow the worker thread to continue on. An alternative would be to detect that all worker threads have died and cause Logstash to exit so that monitoring can catch the issue or a supervisor process can restart it. The latter is less optimal because it means that Logstash slowly degrades in event throughput over time as it has fewer and fewer living worker threads.

In summary: I recommend you reopen this. Since I no longer run logstash, I don't have a dog in this fight anymore other than a) my previously invested time, and b) trying to make a great open source project greater. So I'm going to drop this unless asked directly for clarification.

@andrewvc
Copy link
Contributor

andrewvc commented Jan 9, 2017

@lexelby I totally agree with you on this, but the good news is we still don't need to reopen this since the issue was fixed a long time ago. Logstash will now fully crash when an output raise an unexpected error.

This is the correct behavior IMHO, since outputs should never raise exceptions to the pipeline. That is considered a bug because there's one 'right' error handling strategy. The only safe thing to do is crash.

@lexelby
Copy link
Author

lexelby commented Jan 9, 2017

AWESOME! I agree, that's a great strategy. It surfaces the problem to the operator so that they can report the bug upstream. Thanks for clarifying!

@andrewvc
Copy link
Contributor

andrewvc commented Jan 9, 2017

Thanks for following up on this @lexelby :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests