exception in output thread silently locks up logstash #2130

lexelby · 2014-11-25T20:19:59Z

I'm copying this from the logstash JIRA because I've run into it yet again. This is a pretty serious problem and I've wasted a lot of time debugging it.

If an output plugin encounters an exception (such as LOGSTASH-1504 et. al), the output crashes silently. The exception is printed to stdout and never reaches the log. This causes that output's SizedQueue to no longer be consumed, and eventually filter workers and then input workers block on SizedQueue pushes. Logstash appears to cease processing messages without warning or explanation.

The only way I was able to debug this was to run logstash in a screen and watch the output. I ran it with -vv because I had no idea why it was locking up, and I set my scrollback to 100000 lines. At this point I was able to see my output thread (elasticsearch_river) crash, although I had to scroll a ways back to find this exception, since the other threads continued on for awhile.
Here's the exception from my output thread:
org.jruby.exceptions.RaiseException: (InvalidByteSequenceError) ""\xB9"" on ASCII-8BIT
at org.jruby.RubyString.encode(org/jruby/RubyString.java:7599)
at json.ext.GeneratorMethods$RbHash.to_json(json/ext/GeneratorMethods.java:71)
at LogStash::Event.to_json(file:/opt/logstash/logstash.jar!/logstash/event.rb:157)
at LogStash::Outputs::ElasticSearchRiver.receive(file:/opt/logstash/logstash.jar!/logstash/outputs/elasticsearch_river.rb:212)
at LogStash::Outputs::Base.handle(file:/opt/logstash/logstash.jar!/logstash/outputs/base.rb:86)
at RUBY.initialize((eval):247)Date filter looking for field {:type=>"syslog", :field=>"timestamp", :level=>:debug, :file=>"/opt/logstash/logstash.jar!/logstash/filters/date.rb", :line=>"197"}
at org.jruby.RubyProc.call(org/jruby/RubyProc.java:271)
at LogStash::Pipeline.output(file:/opt/logstash/logstash.jar!/logstash/pipeline.rb:259)
at RUBY.outputworker(file:/opt/logstash/logstash.jar!/logstash/pipeline.rb:218)
at RUBY.start_outputs(file:/opt/logstash/logstash.jar!/logstash/pipeline.rb:145)

It's irrelevant why my output thread is crashing, but for those curious, our vulnerability scanner sent garbage bytes to the syslog port and logstash happily ate them, until it tried to convert the message to UTF-8 in the JSON codec.

I would set the priority of this to critical if I could. I almost dropped logstash because this issue made it inexplicably unreliable. At least logging the exception would have given me a clue as to how to proceed. A better solution would be to drop/log the offending message and restart the output.

Exceptions in inputs are handled by logging the exception and restarting the input worker. Exceptions in filters are handled by logging the exception and shutting down the filter worker. At least in either of these cases, the sysadmin has some clue about what went wrong.

bscruzpt · 2014-12-05T15:45:37Z

+1, got the exact same problem.

jordansissel · 2015-01-05T10:03:34Z

@lexelby @bscruzpt what version of logstash?

jordansissel · 2015-01-05T10:06:29Z

This is a pretty serious problem and I've wasted a lot of time debugging it.

I agree. We'll make sure it gets fixed! Sorry you've spent so much time on this.

jordansissel · 2015-01-05T10:09:35Z

On my "what version" question - I see in the stack trace meniton of "logstash.jar" which means you're running Logstash v1.3.3 or older.

I don't know if this solves anything for you, but Logstash 1.4.0 added a ton of text encoding fixes. Notably some automatic correction for really-poorly-encoded text, as well as warnings being logged at input time about bad text encoding (prior behavior was to give some confusing error during output).
Relevant v1.4.x changelog entry https://github.com/elasticsearch/logstash/blob/v1.4.2/CHANGELOG#L68-L70

Hope this helps! Of course, if you see this on Logstash 1.4.x or 1.5.0.beta1, definitely let us know more details :)

bscruzpt · 2015-01-05T12:54:11Z

Hey @jordansissel , I'm using version 1.4.2, having installed Logstash in RedHat (via logstash-1.4.2-1_2c0f5a1.noarch.rpm)

lexelby · 2015-01-05T13:26:21Z

You're right that I was running pre-1.4 -- 1.3.1, I believe. However, when I created this issue, I read the code for 1.4 (and 1.5, I think) and verified that the issue is still present. Check the code; an exception in an output still means that the output thread dies and nothing will read from the SizedQueue.

lexelby · 2015-01-05T13:44:16Z

The relevant section from trunk:

https://github.com/elasticsearch/logstash/blob/master/lib/logstash/pipeline.rb#L225

Input crashes result in a retry:

https://github.com/elasticsearch/logstash/blob/master/lib/logstash/pipeline.rb#L187

A crash in one filterworker thread causes all filters to be torn down (?!):

https://github.com/elasticsearch/logstash/blob/master/lib/logstash/pipeline.rb#L222

That last may warrant its own bug report.

talevy · 2015-01-14T01:21:44Z

@lexelby could you provide a sample config or at least a subset of your config for us to recreate the issue. Depending on the behavior of the plugin used, experience may vary per use case. Which inputs, filters, outputs are you using which are causing this behavior?

lexelby · 2015-01-14T01:37:38Z

This bug has absolutely nothing to do with the specific output plugin used. About the only relevant attribute of the plugin is whether it's multithreaded, and I'm not even sure that matters. To reproduce this bug, write a very simple output plugin that prints messages to stdout. When a certain keyword is present in the message, it should instead raise or cause to be raised an exception and not catch it.

Then set up a pipeline with just a stdin input and that output, with no filters. Send a few messages that don't trigger the exception just to make sure it works. Then send one that triggers an exception. I believe you'll see a traceback at this point but Logstash won't exit. Now, when you enter more messages, they won't be printed out. After 20 messages, the internal SizedQueue will fill and logstash will stop receiving messages. If you needed to debug this, you could create a filter that also printed messages to stdout so that you could observe the fact that the entire pipeline has stalled.

wiibaa · 2015-01-14T14:06:41Z

@jordansissel @talevy turning this issue into a code question:

Would it be meaningful to apply a similar retry logic in pipeline#outputworker as it is done in pipeline#inputworker , for sure not an optimal solution but far more easy to detect from log barfs

jsvd · 2015-01-21T11:13:41Z

We clearly need more predictable behaviour in the presence of exceptions across all the plugins types (input,output,filter,codec).

input retries constantly after exceptions: not sure if this is desirable as can lead to stuck instances
the remaining plugins will exit in the presence of an exception

However, each plugin can have its own being/rescue/retry blocks and thus prevent exceptions reaching the core logstash code.

I think the correct behaviour we want in the plugin workers (core pipeline.rb code) is either crash immediately or to retry on failure up to a certain number of times per unit of time (e.g. 3 exceptions per 2 minutes). Retrying indefinitely in filters and outputs will result data loss because we're pulling data from inputs in the meantime. Opinions?

lexelby · 2015-02-10T13:33:34Z

Could we please remove the "needs more info" tag on this issue? I've explained why I don't think it needs more info, and I've explained the bug in a great deal of detail. I could submit a pull request if you feel that that would be necessary, but I'd be much more interested in hearing the devs' thoughts on the best solution here.

jsvd · 2015-02-10T13:38:07Z

@lexelby removed. I wrote more about this issue here and also created a proposal here

lexelby · 2015-02-10T14:21:56Z

Thanks!!

blysik · 2015-02-16T22:46:29Z

This issue is currently killing us also.

ScottChapman · 2015-04-23T18:39:39Z

Is there any update on this? Is it planned to be addressed in 1.5?

elvarb · 2015-05-23T09:16:05Z

I had this problem as well. My solution was just to use i18n filter on everything first. It removes all special characters so all filters that follow will have no issues with bogus data.

inqueue · 2015-10-20T00:14:01Z

A similar issue can occur, for example, when failing to configure the cluster name in the elasticsearch output plugin.

bin/logstash -f /etc/logstash/conf.d/1-zimbra-postfix.conf
Oct 19, 2015 7:57:22 PM org.elasticsearch.node.internal.InternalNode <init>
INFO: [logstash-zmb1.zmbdom.tld-28028-13458] version[1.7.0], pid[28028], build[929b973/2015-07-16T14:31:07Z]
Oct 19, 2015 7:57:22 PM org.elasticsearch.node.internal.InternalNode <init>
INFO: [logstash-zmb1.zmbdom.tld-28028-13458] initializing ...
Oct 19, 2015 7:57:22 PM org.elasticsearch.plugins.PluginsService <init>
INFO: [logstash-zmb1.zmbdom.tld-28028-13458] loaded [], sites []
Oct 19, 2015 7:57:23 PM org.elasticsearch.bootstrap.Natives <clinit>
WARNING: JNA not found. native methods will be disabled.
Oct 19, 2015 7:57:24 PM org.elasticsearch.node.internal.InternalNode <init>
INFO: [logstash-zmb1.zmbdom.tld-28028-13458] initialized
Oct 19, 2015 7:57:24 PM org.elasticsearch.node.internal.InternalNode start
INFO: [logstash-zmb1.zmbdom.tld-28028-13458] starting ...
Oct 19, 2015 7:57:24 PM org.elasticsearch.transport.TransportService doStart
INFO: [logstash-zmb1.zmbdom.tld-28028-13458] bound_address {inet[/0:0:0:0:0:0:0:0:9301]}, publish_address {inet[/192.168.5.16:9301]}
Oct 19, 2015 7:57:24 PM org.elasticsearch.discovery.DiscoveryService doStart
INFO: [logstash-zmb1.zmbdom.tld-28028-13458] elasticsearch/ETb8ivnuSeGgOrceTw2O8Q
Oct 19, 2015 7:57:54 PM org.elasticsearch.discovery.DiscoveryService waitForInitialState
WARNING: [logstash-zmb1.zmbdom.tld-28028-13458] waited for 30s and no initial state was set by the discovery
Oct 19, 2015 7:57:54 PM org.elasticsearch.node.internal.InternalNode start
INFO: [logstash-zmb1.zmbdom.tld-28028-13458] started
Failed to install template: waited for [30s] {:level=>:error}
Logstash startup completed
...events truncated...
Got error to send bulk of actions: blocked by: [SERVICE_UNAVAILABLE/1/state not recovered / initialized];[SERVICE_UNAVAILABLE/2/no master]; {:level=>:error}
Failed to flush outgoing items {:outgoing_count=>10, :exception=>"Java::OrgElasticsearchClusterBlock::ClusterBlockException", :backtrace=>["org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedException(org/elasticsearch/cluster/block/ClusterBlocks.java:151)", "org.elasticsearch.cluster.block.ClusterBlocks.globalBlockedRaiseException(org/elasticsearch/cluster/block/ClusterBlocks.java:141)", "org.elasticsearch.action.bulk.TransportBulkAction.executeBulk(org/elasticsearch/action/bulk/TransportBulkAction.java:215)", "org.elasticsearch.action.bulk.TransportBulkAction.access$000(org/elasticsearch/action/bulk/TransportBulkAction.java:67)", "org.elasticsearch.action.bulk.TransportBulkAction$1.onFailure(org/elasticsearch/action/bulk/TransportBulkAction.java:153)", "org.elasticsearch.action.support.TransportAction$ThreadedActionListener$2.run(org/elasticsearch/action/support/TransportAction.java:137)", "java.util.concurrent.ThreadPoolExecutor.runWorker(java/util/concurrent/ThreadPoolExecutor.java:1142)", "java.util.concurrent.ThreadPoolExecutor$Worker.run(java/util/concurrent/ThreadPoolExecutor.java:617)", "java.lang.Thread.run(java/lang/Thread.java:745)"], :level=>:warn}
...repeats

It took me a while to figure out what was going on. It'd be nice of the ES output plugin could at least log an event when it is unable to connect to the configured cluster name. In my case, the plugin was just trying the default elasticsearch.

hokie1999 · 2016-08-24T13:52:14Z

Hello, was there ever a fix to this exception? We recently updated to 2.3.4 (July 7, 2016) from a 1:2.3.3-1 that was experiencing exception issues. Please update and thanks.

untergeek · 2017-01-06T20:11:58Z

Going to close this as it seems to be:

very outdated
largely per plugin, rather than a core specific issue
PQ is rendering it less of an issue, including many recent fixes

If you feel this is in error, please open a new ticket in the repository for the specific plugin.

lexelby · 2017-01-08T01:07:13Z

@untergeek: I find this very frustrating.

I previously explained that this is not a per-plugin issue, but instead a design flaw in the way that outputs are run. See my comment above for more detail.

I did a quick code dive and it appears that the problem still exists, so this issue is not outdated. Check out output_batch() in pipeline.rb. The link is to the (current) trunk version.

output_func(event) gets called without being wrapped in a begin...rescue block. If any output allows an exception to reach the top level, that worker thread dies. I don't see any code to restart worker threads.

Persistent queues don't help. I mean, sure, they make it so that you don't actually lose in-memory messages if all of your worker threads die. But you still end up with logstash stalled, just silently sitting there not processing any messages. PQs have the same https://github.com/elastic/logstash/blob/2c1736531561dd64d51e923b635618c8b7ee904f/docs/static/persistent-queues.asciidoc#handling-back-pressure mechanism that stalls the pipeline if no worker threads are running.

The solution is still simple: catch exceptions from outputs, log them, and allow the worker thread to continue on. An alternative would be to detect that all worker threads have died and cause Logstash to exit so that monitoring can catch the issue or a supervisor process can restart it. The latter is less optimal because it means that Logstash slowly degrades in event throughput over time as it has fewer and fewer living worker threads.

In summary: I recommend you reopen this. Since I no longer run logstash, I don't have a dog in this fight anymore other than a) my previously invested time, and b) trying to make a great open source project greater. So I'm going to drop this unless asked directly for clarification.

andrewvc · 2017-01-09T20:27:04Z

@lexelby I totally agree with you on this, but the good news is we still don't need to reopen this since the issue was fixed a long time ago. Logstash will now fully crash when an output raise an unexpected error.

This is the correct behavior IMHO, since outputs should never raise exceptions to the pipeline. That is considered a bug because there's one 'right' error handling strategy. The only safe thing to do is crash.

lexelby · 2017-01-09T20:28:57Z

AWESOME! I agree, that's a great strategy. It surfaces the problem to the operator so that they can report the bug upstream. Thanks for clarifying!

andrewvc · 2017-01-09T20:29:49Z

Thanks for following up on this @lexelby :)

suyograo added the enhancement label Dec 18, 2014

suyograo assigned talevy Dec 18, 2014

suyograo added this to the v1.5.0 milestone Dec 18, 2014

suyograo added the v1.5.0 label Dec 18, 2014

jordansissel added needs more info and removed enhancement labels Jan 5, 2015

suyograo assigned jsvd and unassigned talevy Jan 19, 2015

suyograo mentioned this issue Jan 22, 2015

add robustness to avoid crashing on unexepected input #1250

Closed

This was referenced Jan 23, 2015

Normalize the exception behaviour for inputs outputs and filters #2386

Closed

Shutdown Semantics: Exception handling in the pipeline #2477

Open

lexelby mentioned this issue Feb 10, 2015

outputs can crash silently, locking up entire pipeline #2545

Closed

jsvd added bug confirmed and removed needs more info labels Feb 10, 2015

suyograo modified the milestones: 1.5.1, v1.5.0 Feb 26, 2015

jsvd removed the v1.5.0 label Feb 26, 2015

jsvd modified the milestones: v1.6.0, v1.5.1 Feb 26, 2015

seventy-7 mentioned this issue Mar 18, 2015

logstash stalled/blocked - 1.5.0rc2 #2846

Closed

suyograo added v1.6.0 and removed confirmed labels May 15, 2015

suyograo removed this from the v1.5.1 milestone May 15, 2015

jsvd mentioned this issue Jun 17, 2015

stalled outputs will prevent proper shutdown #3451

Closed

7 tasks

colinsurprenant mentioned this issue Jun 24, 2015

If an output plugin stop retrieving elements from the queue, the stop sequence does not work #3491

Closed

suyograo added pipeline stalls and removed v1.6.0 labels Sep 25, 2015

suyograo added the exception-handling label Mar 11, 2016

untergeek closed this as completed Jan 6, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

exception in output thread silently locks up logstash #2130

exception in output thread silently locks up logstash #2130

lexelby commented Nov 25, 2014

bscruzpt commented Dec 5, 2014

jordansissel commented Jan 5, 2015

jordansissel commented Jan 5, 2015

jordansissel commented Jan 5, 2015

bscruzpt commented Jan 5, 2015

lexelby commented Jan 5, 2015

lexelby commented Jan 5, 2015

talevy commented Jan 14, 2015

lexelby commented Jan 14, 2015

wiibaa commented Jan 14, 2015

jsvd commented Jan 21, 2015

lexelby commented Feb 10, 2015

jsvd commented Feb 10, 2015

lexelby commented Feb 10, 2015

blysik commented Feb 16, 2015

ScottChapman commented Apr 23, 2015

elvarb commented May 23, 2015

inqueue commented Oct 20, 2015

hokie1999 commented Aug 24, 2016

untergeek commented Jan 6, 2017

lexelby commented Jan 8, 2017 •

edited

Loading

andrewvc commented Jan 9, 2017

lexelby commented Jan 9, 2017

andrewvc commented Jan 9, 2017

exception in output thread silently locks up logstash #2130

exception in output thread silently locks up logstash #2130

Comments

lexelby commented Nov 25, 2014

bscruzpt commented Dec 5, 2014

jordansissel commented Jan 5, 2015

jordansissel commented Jan 5, 2015

jordansissel commented Jan 5, 2015

bscruzpt commented Jan 5, 2015

lexelby commented Jan 5, 2015

lexelby commented Jan 5, 2015

talevy commented Jan 14, 2015

lexelby commented Jan 14, 2015

wiibaa commented Jan 14, 2015

jsvd commented Jan 21, 2015

lexelby commented Feb 10, 2015

jsvd commented Feb 10, 2015

lexelby commented Feb 10, 2015

blysik commented Feb 16, 2015

ScottChapman commented Apr 23, 2015

elvarb commented May 23, 2015

inqueue commented Oct 20, 2015

hokie1999 commented Aug 24, 2016

untergeek commented Jan 6, 2017

lexelby commented Jan 8, 2017 • edited Loading

andrewvc commented Jan 9, 2017

lexelby commented Jan 9, 2017

andrewvc commented Jan 9, 2017

lexelby commented Jan 8, 2017 •

edited

Loading