Kibana crashes with `ECONNRESET` when connection to remote cluster is lost due to unhandled promises #110433

sorenlouv · 2021-08-30T07:16:05Z

The problem
Kibana crashes with ECONNRESET error when connection to Elasticsearch cluster is lost (I assume this is the problem)

Expectation
Kibana should not crash when connection to Elasticsearch cluster is lost

Reproduction

This happens on my local developer machine when running yarn start and connecting to an external (cloud based) elasticsearch cluster. If I close the laptop lid and thus disable network, Kibana crashes after about 1 hour.

First I see the following lines:

[warning][kibanaUsageCollection][plugins] Average event loop delay threshold exceeded 350ms. Received 2654.3637022916055ms. See https://ela.st/kibana-scaling-considerations for more information about scaling Kibana.
[error][plugins][taskManager] [Task Poller Monitor]: Observable Monitor: Hung Observable restarted after 33000ms of inactivity
[warning][plugins][taskManager] Detected potential performance issue with Task Manager. Set 'xpack.task_manager.monitored_stats_health_verbose_log.enabled: true' in your Kibana.yml to enable debug logging
[info][status] Kibana is now unavailable (was available)
[info][status] Kibana is now available (was unavailable)

Then after a few more minutes I see:

[warning][kibanaUsageCollection][plugins] Average event loop delay threshold exceeded 350ms. Received 577.03415997404ms. See https://ela.st/kibana-scaling-considerations for more information about scaling Kibana.
[error][plugins][taskManager] [Task Poller Monitor]: Observable Monitor: Hung Observable restarted after 33000ms of inactivity
[warning][plugins][taskManager] Detected potential performance issue with Task Manager. Set 'xpack.task_manager.monitored_stats_health_verbose_log.enabled: true' in your Kibana.yml to enable debug logging

And then it crashes with:

Unhandled Promise rejection detected:

ConnectionError: read ECONNRESET
    at ClientRequest.onError (/Users/sqren/elastic/kibana/node_modules/@elastic/elasticsearch/lib/Connection.js:116:16)
    at ClientRequest.emit (events.js:400:28)
    at TLSSocket.socketErrorListener (_http_client.js:475:9)
    at TLSSocket.emit (events.js:400:28)
    at emitErrorNT (internal/streams/destroy.js:106:8)
    at emitErrorCloseNT (internal/streams/destroy.js:74:3)
    at processTicksAndRejections (internal/process/task_queues.js:82:21) {
  meta: {
    body: null,
    statusCode: null,
    headers: null,
    meta: {
      context: null,
      request: [Object],
      name: 'elasticsearch-js',
      connection: [Object],
      attempts: 0,
      aborted: false
    }
  },
  isBoom: true,
  isServer: true,
  data: null,
  output: {
    statusCode: 503,
    payload: {
      statusCode: 503,
      error: 'Service Unavailable',
      message: 'read ECONNRESET'
    },
    headers: {}
  },
  [Symbol(SavedObjectsClientErrorCode)]: 'SavedObjectsClient/esUnavailable'
}

Terminating process...
 server crashed  with status code 1

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-08-30T07:29:35Z

Pinging @elastic/kibana-core (Team:Core)

sorenlouv · 2021-08-30T07:30:19Z

Somewhat related to #101277 (fyi @mshustov)

mshustov · 2021-09-01T12:56:49Z

It looks like a plugin performing a request to the Elasticsearch server doesn't handle an exception.
It's hard to debug the problem without an error reproduction scenario or error stack. I'm not even sure we will get the error stack once the ES client drops callback-style API. (see elastic/elasticsearch-js#1542)
Have you seen any other log records with read ECONNRESET before the exception crashes the server?

afharo · 2021-09-01T14:10:43Z

It looks like a plugin performing a request to the Elasticsearch server doesn't handle an exception.

I tried to hunt it and, unfortunately, I found too many places:

Timelion's start method calls a promise that is not handled: https://github.com/elastic/kibana/blob/master/src/plugins/timelion/server/plugin.ts#L71
APM uses async/await in the .then after core.getStartServices(): https://github.com/elastic/kibana/blob/master/x-pack/plugins/apm/server/plugin.ts#L215-L244
Same type of usage on another place in APM: https://github.com/elastic/kibana/blob/master/x-pack/plugins/apm/server/lib/apm_telemetry/index.ts#L132-L167 (mind that taskManagerStart.ensureScheduled also returns a promise.

I found a useful way to identify the calls making it fail: I created an ES mock that will always return 505 - and the received body to POST and PUT requests (for now). I can continue the investigation to find them all if we want to, although I'd suggest sending an email to Kibana Contributors, asking to each plugin owner to review their implementations to ensure that this is not happening.

However, we may need to document a common way to proceed when plugins need to create something: i.e.: what's the best retrying logic (maybe a core API to do so?).

mshustov · 2021-09-01T14:52:38Z

although I'd suggest sending an email to Kibana Contributors, asking to each plugin owner to review their implementations to ensure that this is not happening.

Yeah, I'm not sure we are able to find all such cases.
An email is a good start to draw attention to the problem. In the long run, we should provide recommendations on error handling #99568 IIRC @kobelb said he could do it later when he had some free time.

mshustov · 2021-09-17T09:20:59Z

@spalger found the same problem and fixed a few places in #111637
we discussed the problem on the last DX call, @spalger is going to give it a try https://github.com/typescript-eslint/typescript-eslint/blob/master/packages/eslint-plugin/docs/rules/no-floating-promises.md for the Core code. The main concern is it needs type information that might affect Eslint performance (see https://github.com/typescript-eslint/typescript-eslint/blob/master/docs/getting-started/linting/TYPED_LINTING.md#performance).
If the trial is successful, we could start rolling it to the other plugins.

elasticmachine · 2021-09-20T08:29:51Z

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

rudolf · 2021-09-20T08:31:42Z

@elastic/kibana-alerting-services Based on other's observations, this seems to happen after:

Observable Monitor: Hung Observable restarted after 33000ms of inactivity

It might just be coincidence though in that when ES is unavailable, task manager is the first to start failing.

mikecote · 2021-09-21T23:45:36Z

@elastic/kibana-alerting-services Based on other's observations, this seems to happen after:

Observable Monitor: Hung Observable restarted after 33000ms of inactivity

It might just be coincidence though in that when ES is unavailable, task manager is the first to start failing.

I think it's a coincidence because the task manager message happens whenever it detects that it didn't poll for a certain amount of time (33000ms in this case). Under the hood, I'm guessing the event loop resumes the function in task manager on wake, sees it's been some time since it polled and logs the message. I believe I've also seen this before when putting my laptop on sleep for a while.

@gmmorris can you confirm the above statement?

gmmorris · 2021-09-22T11:42:50Z

Indeed.
We have an observable that emit an event every time it's supposed to query ES.
We have another observable that emit an event after 30s and if the above mentioned observable has hung for whatever reason (and has failed to emit an event in the past 30s), then it restarts the observable.

It sounds like a symptom rather than the cause.

afharo · 2024-04-26T13:39:50Z

#181456 might finally address this

sorenlouv added the bug Fixes for quality problems that affect the customer experience label Aug 30, 2021

botelastic bot added the needs-team Issues missing a team label label Aug 30, 2021

sorenlouv changed the title ~~ECONNRESET crashed Kibana when network connectivity is lost~~ Kibana crashes with ECONNRESET when connection to remote cluster is lost Aug 30, 2021

sorenlouv added the Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc label Aug 30, 2021

botelastic bot removed the needs-team Issues missing a team label label Aug 30, 2021

rudolf added the Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) label Sep 20, 2021

kobelb added the needs-team Issues missing a team label label Jan 31, 2022

botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022

afharo changed the title ~~Kibana crashes with ECONNRESET when connection to remote cluster is lost~~ Kibana crashes with ECONNRESET when connection to remote cluster is lost due to unhandled promises Jul 19, 2022

afharo self-assigned this Apr 26, 2024

afharo linked a pull request Apr 30, 2024 that will close this issue

Add @typescript-eslint/no-floating-promises #181456

Merged

66 tasks

afharo closed this as completed in #181456 May 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kibana crashes with `ECONNRESET` when connection to remote cluster is lost due to unhandled promises #110433

Kibana crashes with `ECONNRESET` when connection to remote cluster is lost due to unhandled promises #110433

sorenlouv commented Aug 30, 2021

elasticmachine commented Aug 30, 2021

sorenlouv commented Aug 30, 2021 •

edited

Loading

mshustov commented Sep 1, 2021

afharo commented Sep 1, 2021

mshustov commented Sep 1, 2021

mshustov commented Sep 17, 2021

elasticmachine commented Sep 20, 2021

rudolf commented Sep 20, 2021

mikecote commented Sep 21, 2021

gmmorris commented Sep 22, 2021

afharo commented Apr 26, 2024

Kibana crashes with ECONNRESET when connection to remote cluster is lost due to unhandled promises #110433

Kibana crashes with ECONNRESET when connection to remote cluster is lost due to unhandled promises #110433

Comments

sorenlouv commented Aug 30, 2021

Reproduction

elasticmachine commented Aug 30, 2021

sorenlouv commented Aug 30, 2021 • edited Loading

mshustov commented Sep 1, 2021

afharo commented Sep 1, 2021

mshustov commented Sep 1, 2021

mshustov commented Sep 17, 2021

elasticmachine commented Sep 20, 2021

rudolf commented Sep 20, 2021

mikecote commented Sep 21, 2021

gmmorris commented Sep 22, 2021

afharo commented Apr 26, 2024

Kibana crashes with `ECONNRESET` when connection to remote cluster is lost due to unhandled promises #110433

Kibana crashes with `ECONNRESET` when connection to remote cluster is lost due to unhandled promises #110433

sorenlouv commented Aug 30, 2021 •

edited

Loading