Shutdown Semantics: Exception handling in the pipeline #2477

jsvd · 2015-01-29T17:40:30Z

Related issues/prs:
#1373
#1250
#2130
#2152

The pipeline is composed of {input,filter,output}workers that execute the plugins' code and take care of uncaught exceptions that may occur during their execution.

These workers should tolerate (hopefully) transient exceptions and not allow the pipeline to crash. Also, known fatal exceptions should abort the pipeline execution and terminate logstash. Therefore, workers should have the following behaviour:

Input workers have a long running method call `run`

begin
  # call input_plugin.run (usually a loop)
rescue TransientException
  @logger.warn event, exception
  retry
rescue FatalException
  @logger.error event, exception
rescue ShutdownSignal
  @logger.info event
ensure       # in both situations above we just want to log, call teardown, and exit
  teardown
end

Filters and output workers continuously pop from a queue

begin
  loop 
    event = queue.pop
    if LogStash::Event
      # call plugin code
    elsif LogStash::ShutdownEvent
      # if filter pass event along
      break
    end
  end
rescue TransientException
  @logger.warn event, exception
  retry
rescue FatalException
  @logger.error event, exception
ensure
  @filters.each teardown
  # or
  @outputs.each teardown
end

What is a TransientException / FatalException?

We have no way of knowing what the plugin considers a transient/fatal exception. Two possible choices:

Treat all StandardError derived exceptions as transient, those derived from Exception as fatal
Provide LogStash::TransientException as a parent exception for transient exceptions, LogStash::FatalException for fatal ones. Plugins would only throw descendants of these 2 classes

Questions

Should a transient failure call some setup/teardown behavior on the plugin?
Current code for the input plugin calls teardown and then retries, but at that point the plugin instance might no longer be usable.

For filters and outputs, the shutdown is done through a ShutdownEvent. However, if a plugin is very slow (e.g. executing sleep 100000), it will never pop the ShutdownEvent from the queue and execute teardown. Should there be a pipeline.shutdown(force=true) on some situations?

Thread naming problem

Related issues:
#2462
#2425

Best practices when developing plugins

Exception handling recommendations:

which kind of exceptions should reach the worker?
what happens if a plugin misbehaves (throws top level Exception, hangs)?

The text was updated successfully, but these errors were encountered:

dadoonet · 2015-04-01T13:18:37Z

Not sure it's related. When I have an exception in a filter, I can't stop anymore Logstash process from running when using Linux pipe.

For example:

Consider the following logstash.conf file:

input {
  stdin { }
}

filter {
  grok {
    match => {
      "message" => '^%{DATA:start}\[%{HTTPDATE:timestamp}\]%{DATA:end}$'
    }
  }

  date {
    match => [ "timestamp", "dd/MMM/YYYY:HH:mm:ss Z" ]
    locale => en
  }

  # Bump the time forward
  ruby {
    init => "last = Time.iso8601('2014-09-25T12:00:00+00:00'); @shift = Time.iso8601('2015-04-10T15:00:00.000+02:00') - last"
    code => "event['@timestamp'] += @shift"
  }

  mutate {
    replace => { "message" => "%{start}[%{+dd/MMM/YYYY:HH:mm:ss Z}]%{end}" }
  }
}

output {
  stdout { codec => line { format => "%{message}" } }
}

the error come from the fact in LS 1.5 we don't use Time but LogStash::Timestamp class.

When running:

bin/logstash -f logstash.conf

I got the error and I can stop the process using CTRL+D.

But when running:

cat logs | bin/logstash -f logstash.conf

I can not stop the Logstash process and I need to kill -9 it.
Wanna another issue for it or is this comment enough?

joekiller · 2015-04-02T16:07:07Z

I too am experiencing this. #2386 didn't help so I'm trying #1373 to see what that does.

colinsurprenant · 2015-04-02T16:12:15Z

fyi, #2869 fixed a critical shutdown bug. also, #2928 will fix additional shutdown conditions.

@dadoonet note that shutdown hanging when using the stdin input is a log standing issue for which we do not have a solution yet, see #1769

joekiller · 2015-04-02T16:26:29Z

okay I'll try out the latest again

joekiller · 2015-04-05T16:22:13Z

The aforementioned fixes make 1.5 appear to be stable and working. Still
testing but I'm not hitting the pipeline freezes like before.
On Apr 2, 2015 12:12 PM, "Colin Surprenant" notifications@github.com
wrote:

fyi, #2869 #2869 fixed a
critical shutdown bug. also, #2928
#2928 will also fix additional
shutdown conditions.
also @dadoonet https://github.com/dadoonet note that shutdown hanging
when using the stdin input is a log standing issue for which we do not
have a solution yet, see #1769
#1769

—
Reply to this email directly or view it on GitHub
#2477 (comment).

colinsurprenant · 2015-04-05T19:15:29Z

@joekiller exactly which "aforementioned fixes"?

joekiller · 2015-04-05T19:58:43Z

#2928 I'm thinking however I did take the build from master so it could
have been something else but I had problems shutting down logstash when the
problem occurred and it would appear that a thread was stuck.
On Apr 5, 2015 3:15 PM, "Colin Surprenant" notifications@github.com wrote:

@joekiller exactly which "aforementioned fixes"?

—
Reply to this email directly or view it on GitHub.

guyboertje · 2015-08-27T13:58:10Z

My 2 cents. Many plugins do not handle the range of exceptions with enough forethought.

For example, imagine a SMTP output plugin. There are many ways that a SMTP call can fail:
Recipient errors (hard: receipient unknown, soft: mailbox full)
SMTP errors (hard: malformation, soft: server busy)
Auth errors (hard)
Network errors
Ruby errors

We need a mechanism in the plugin to handle these sorts of errors appropriately. I suggest a default wrapper that provides the minimum standard and encourage plugin developers to provide a specific subclass for that plugin if the default one is too generic.
e.g.

def receive(event)
  DefaultErrorHandler.new(self).exec do
    # some receive code
  end
end

and in LogStash::Plugin

class DefaultErrorHandler < SimpleDelegator
  # initialize provided by SimpleDelegator
  def exec
    begin
      yield if block_given?
    rescue StandardError => e
      logger.error(default_error_log_message, :error => e.inspect)
    end
  end

  private

  def default_error_log_message
    "Unexpected error occurred in #{type} plugin: #{name}"
  end
end

and in the fictitious SMTP output

SmtpTransientError = Class.new(TransientException)

class SmtpOutputErrorHandler < SimpleDelegator
  MaxRetries = 10
  attr_reader :retry_count
  # initialize provided by SimpleDelegator
  def exec
    @retry_count = 0
    begin
      yield if block_given?
    rescue Net::SMTPAuthenticationError,
           Net::SMTPSyntaxError,
           Net::SMTPFatalError,
           Net::SMTPUnknownError => e
        raise SmtpTransientError.build_from(e)
    rescue Net::SMTPServerBusy,
           Net::OpenTimeout,
           IOError
        if retry_count < MaxRetries
          retry_count = retry_count.next
          sleep 0.1
          retry
        else
          logger.error(retry_error_log_message(e), :error => e.inspect)
        end
    rescue StandardError => e
      logger.error(default_error_log_message, :error => e.inspect)
    end
  end

  private

  def retry_error_log_message(error)
    "#{error.class} error occurred in output plugin: SMTP, retries exhausted"
  end
end

Another example, a Mongo plugin, it should provide rescue causes for Mongo errors and non Mongo errors, as a Mongo error has an error_code method and other errors do not.

jsvd added design discuss labels Jan 29, 2015

jsvd self-assigned this Jan 29, 2015

This was referenced Jan 30, 2015

SQS input plugin won't restart #1776

Closed

Normalize the exception behaviour for inputs outputs and filters #2386

Closed

jsvd mentioned this issue Feb 10, 2015

exception in output thread silently locks up logstash #2130

Closed

ph mentioned this issue Feb 11, 2015

http module fails and breaks logstash #2560

Closed

jsvd modified the milestones: v1.6.0, v1.5.1 Feb 26, 2015

suyograo added the v1.6.0 label May 15, 2015

suyograo removed this from the v1.5.1 milestone May 15, 2015

colinsurprenant mentioned this issue May 21, 2015

When something is not jsonable, the service stop #3289

Closed

colinsurprenant mentioned this issue Jun 16, 2015

stalled outputs will prevent proper shutdown #3451

Closed

7 tasks

purbon mentioned this issue Aug 26, 2015

Exception handling in plugins #3794

Closed

suyograo added v2.0.0 and removed v1.6.0 labels Sep 1, 2015

suyograo changed the title ~~Exception handling in the pipeline~~ Shutdown Semantics: Exception handling in the pipeline Sep 1, 2015

suyograo removed the v2.0.0 label Sep 25, 2015

This was referenced Oct 12, 2015

Make yaml loading error more transparent to the end user logstash-plugins/logstash-filter-translate#13

Closed

undefined method `compact' when tags field is an array logstash-plugins/logstash-output-riemann#12

Open

suyograo added the v2.1.0 label Oct 20, 2015

suyograo added v2.2.0 and removed v2.1.0 labels Nov 16, 2015

suyograo added v2.3.0 and removed v2.2.0 labels Dec 16, 2015

spujadas mentioned this issue Jan 12, 2016

filebeat broken pipe spujadas/elk-docker#14

Closed

suyograo added code cleanup shutdown semantics and removed v2.3.0 labels Feb 28, 2016

suyograo removed the shutdown semantics label Sep 14, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shutdown Semantics: Exception handling in the pipeline #2477

Shutdown Semantics: Exception handling in the pipeline #2477

jsvd commented Jan 29, 2015

dadoonet commented Apr 1, 2015

joekiller commented Apr 2, 2015

colinsurprenant commented Apr 2, 2015

joekiller commented Apr 2, 2015

joekiller commented Apr 5, 2015

colinsurprenant commented Apr 5, 2015

joekiller commented Apr 5, 2015

guyboertje commented Aug 27, 2015

Shutdown Semantics: Exception handling in the pipeline #2477

Shutdown Semantics: Exception handling in the pipeline #2477

Comments

jsvd commented Jan 29, 2015

Input workers have a long running method call run

Filters and output workers continuously pop from a queue

What is a TransientException / FatalException?

Questions

Thread naming problem

Best practices when developing plugins

dadoonet commented Apr 1, 2015

joekiller commented Apr 2, 2015

colinsurprenant commented Apr 2, 2015

joekiller commented Apr 2, 2015

joekiller commented Apr 5, 2015

colinsurprenant commented Apr 5, 2015

joekiller commented Apr 5, 2015

guyboertje commented Aug 27, 2015

Input workers have a long running method call `run`