Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shutdown Semantics: Exception handling in the pipeline #2477

Open
jsvd opened this issue Jan 29, 2015 · 8 comments
Open

Shutdown Semantics: Exception handling in the pipeline #2477

jsvd opened this issue Jan 29, 2015 · 8 comments

Comments

@jsvd
Copy link
Member

jsvd commented Jan 29, 2015

Related issues/prs:
#1373
#1250
#2130
#2152

The pipeline is composed of {input,filter,output}workers that execute the plugins' code and take care of uncaught exceptions that may occur during their execution.

These workers should tolerate (hopefully) transient exceptions and not allow the pipeline to crash. Also, known fatal exceptions should abort the pipeline execution and terminate logstash. Therefore, workers should have the following behaviour:

Input workers have a long running method call run
begin
  # call input_plugin.run (usually a loop)
rescue TransientException
  @logger.warn event, exception
  retry
rescue FatalException
  @logger.error event, exception
rescue ShutdownSignal
  @logger.info event
ensure       # in both situations above we just want to log, call teardown, and exit
  teardown
end
Filters and output workers continuously pop from a queue
begin
  loop 
    event = queue.pop
    if LogStash::Event
      # call plugin code
    elsif LogStash::ShutdownEvent
      # if filter pass event along
      break
    end
  end
rescue TransientException
  @logger.warn event, exception
  retry
rescue FatalException
  @logger.error event, exception
ensure
  @filters.each teardown
  # or
  @outputs.each teardown
end

What is a TransientException / FatalException?

We have no way of knowing what the plugin considers a transient/fatal exception. Two possible choices:

  1. Treat all StandardError derived exceptions as transient, those derived from Exception as fatal
  2. Provide LogStash::TransientException as a parent exception for transient exceptions, LogStash::FatalException for fatal ones. Plugins would only throw descendants of these 2 classes

Questions

Should a transient failure call some setup/teardown behavior on the plugin?
Current code for the input plugin calls teardown and then retries, but at that point the plugin instance might no longer be usable.

For filters and outputs, the shutdown is done through a ShutdownEvent. However, if a plugin is very slow (e.g. executing sleep 100000), it will never pop the ShutdownEvent from the queue and execute teardown. Should there be a pipeline.shutdown(force=true) on some situations?

Thread naming problem

Related issues:
#2462
#2425

Best practices when developing plugins

Exception handling recommendations:

  • which kind of exceptions should reach the worker?
  • what happens if a plugin misbehaves (throws top level Exception, hangs)?
@dadoonet
Copy link
Member

dadoonet commented Apr 1, 2015

Not sure it's related. When I have an exception in a filter, I can't stop anymore Logstash process from running when using Linux pipe.

For example:

Consider the following logstash.conf file:

input {
  stdin { }
}

filter {
  grok {
    match => {
      "message" => '^%{DATA:start}\[%{HTTPDATE:timestamp}\]%{DATA:end}$'
    }
  }

  date {
    match => [ "timestamp", "dd/MMM/YYYY:HH:mm:ss Z" ]
    locale => en
  }

  # Bump the time forward
  ruby {
    init => "last = Time.iso8601('2014-09-25T12:00:00+00:00'); @shift = Time.iso8601('2015-04-10T15:00:00.000+02:00') - last"
    code => "event['@timestamp'] += @shift"
  }

  mutate {
    replace => { "message" => "%{start}[%{+dd/MMM/YYYY:HH:mm:ss Z}]%{end}" }
  }
}

output {
  stdout { codec => line { format => "%{message}" } }
}

the error come from the fact in LS 1.5 we don't use Time but LogStash::Timestamp class.

When running:

bin/logstash -f logstash.conf

I got the error and I can stop the process using CTRL+D.

But when running:

cat logs | bin/logstash -f logstash.conf

I can not stop the Logstash process and I need to kill -9 it.
Wanna another issue for it or is this comment enough?

@joekiller
Copy link
Contributor

I too am experiencing this. #2386 didn't help so I'm trying #1373 to see what that does.

@colinsurprenant
Copy link
Contributor

fyi, #2869 fixed a critical shutdown bug. also, #2928 will fix additional shutdown conditions.

@dadoonet note that shutdown hanging when using the stdin input is a log standing issue for which we do not have a solution yet, see #1769

@joekiller
Copy link
Contributor

okay I'll try out the latest again

@joekiller
Copy link
Contributor

The aforementioned fixes make 1.5 appear to be stable and working. Still
testing but I'm not hitting the pipeline freezes like before.
On Apr 2, 2015 12:12 PM, "Colin Surprenant" notifications@github.com
wrote:

fyi, #2869 #2869 fixed a
critical shutdown bug. also, #2928
#2928 will also fix additional
shutdown conditions.
also @dadoonet https://github.com/dadoonet note that shutdown hanging
when using the stdin input is a log standing issue for which we do not
have a solution yet, see #1769
#1769


Reply to this email directly or view it on GitHub
#2477 (comment).

@colinsurprenant
Copy link
Contributor

@joekiller exactly which "aforementioned fixes"?

@joekiller
Copy link
Contributor

#2928 I'm thinking however I did take the build from master so it could
have been something else but I had problems shutting down logstash when the
problem occurred and it would appear that a thread was stuck.
On Apr 5, 2015 3:15 PM, "Colin Surprenant" notifications@github.com wrote:

@joekiller exactly which "aforementioned fixes"?


Reply to this email directly or view it on GitHub.

@guyboertje
Copy link
Contributor

My 2 cents. Many plugins do not handle the range of exceptions with enough forethought.

For example, imagine a SMTP output plugin. There are many ways that a SMTP call can fail:
Recipient errors (hard: receipient unknown, soft: mailbox full)
SMTP errors (hard: malformation, soft: server busy)
Auth errors (hard)
Network errors
Ruby errors

We need a mechanism in the plugin to handle these sorts of errors appropriately. I suggest a default wrapper that provides the minimum standard and encourage plugin developers to provide a specific subclass for that plugin if the default one is too generic.
e.g.

def receive(event)
  DefaultErrorHandler.new(self).exec do
    # some receive code
  end
end

and in LogStash::Plugin

class DefaultErrorHandler < SimpleDelegator
  # initialize provided by SimpleDelegator
  def exec
    begin
      yield if block_given?
    rescue StandardError => e
      logger.error(default_error_log_message, :error => e.inspect)
    end
  end

  private

  def default_error_log_message
    "Unexpected error occurred in #{type} plugin: #{name}"
  end
end

and in the fictitious SMTP output

SmtpTransientError = Class.new(TransientException)

class SmtpOutputErrorHandler < SimpleDelegator
  MaxRetries = 10
  attr_reader :retry_count
  # initialize provided by SimpleDelegator
  def exec
    @retry_count = 0
    begin
      yield if block_given?
    rescue Net::SMTPAuthenticationError,
           Net::SMTPSyntaxError,
           Net::SMTPFatalError,
           Net::SMTPUnknownError => e
        raise SmtpTransientError.build_from(e)
    rescue Net::SMTPServerBusy,
           Net::OpenTimeout,
           IOError
        if retry_count < MaxRetries
          retry_count = retry_count.next
          sleep 0.1
          retry
        else
          logger.error(retry_error_log_message(e), :error => e.inspect)
        end
    rescue StandardError => e
      logger.error(default_error_log_message, :error => e.inspect)
    end
  end

  private

  def retry_error_log_message(error)
    "#{error.class} error occurred in output plugin: SMTP, retries exhausted"
  end
end

Another example, a Mongo plugin, it should provide rescue causes for Mongo errors and non Mongo errors, as a Mongo error has an error_code method and other errors do not.

@suyograo suyograo added v2.0.0 and removed v1.6.0 labels Sep 1, 2015
@suyograo suyograo changed the title Exception handling in the pipeline Shutdown Semantics: Exception handling in the pipeline Sep 1, 2015
@suyograo suyograo removed the v2.0.0 label Sep 25, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants