Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kibana crashes with ECONNRESET when connection to remote cluster is lost due to unhandled promises #110433

Closed
sorenlouv opened this issue Aug 30, 2021 · 11 comments · Fixed by #181456
Assignees
Labels
bug Fixes for quality problems that affect the customer experience Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@sorenlouv
Copy link
Member

The problem
Kibana crashes with ECONNRESET error when connection to Elasticsearch cluster is lost (I assume this is the problem)

Expectation
Kibana should not crash when connection to Elasticsearch cluster is lost

Reproduction

This happens on my local developer machine when running yarn start and connecting to an external (cloud based) elasticsearch cluster. If I close the laptop lid and thus disable network, Kibana crashes after about 1 hour.

First I see the following lines:

[warning][kibanaUsageCollection][plugins] Average event loop delay threshold exceeded 350ms. Received 2654.3637022916055ms. See https://ela.st/kibana-scaling-considerations for more information about scaling Kibana.
[error][plugins][taskManager] [Task Poller Monitor]: Observable Monitor: Hung Observable restarted after 33000ms of inactivity
[warning][plugins][taskManager] Detected potential performance issue with Task Manager. Set 'xpack.task_manager.monitored_stats_health_verbose_log.enabled: true' in your Kibana.yml to enable debug logging
[info][status] Kibana is now unavailable (was available)
[info][status] Kibana is now available (was unavailable)

Then after a few more minutes I see:

[warning][kibanaUsageCollection][plugins] Average event loop delay threshold exceeded 350ms. Received 577.03415997404ms. See https://ela.st/kibana-scaling-considerations for more information about scaling Kibana.
[error][plugins][taskManager] [Task Poller Monitor]: Observable Monitor: Hung Observable restarted after 33000ms of inactivity
[warning][plugins][taskManager] Detected potential performance issue with Task Manager. Set 'xpack.task_manager.monitored_stats_health_verbose_log.enabled: true' in your Kibana.yml to enable debug logging

And then it crashes with:

Unhandled Promise rejection detected:

ConnectionError: read ECONNRESET
    at ClientRequest.onError (/Users/sqren/elastic/kibana/node_modules/@elastic/elasticsearch/lib/Connection.js:116:16)
    at ClientRequest.emit (events.js:400:28)
    at TLSSocket.socketErrorListener (_http_client.js:475:9)
    at TLSSocket.emit (events.js:400:28)
    at emitErrorNT (internal/streams/destroy.js:106:8)
    at emitErrorCloseNT (internal/streams/destroy.js:74:3)
    at processTicksAndRejections (internal/process/task_queues.js:82:21) {
  meta: {
    body: null,
    statusCode: null,
    headers: null,
    meta: {
      context: null,
      request: [Object],
      name: 'elasticsearch-js',
      connection: [Object],
      attempts: 0,
      aborted: false
    }
  },
  isBoom: true,
  isServer: true,
  data: null,
  output: {
    statusCode: 503,
    payload: {
      statusCode: 503,
      error: 'Service Unavailable',
      message: 'read ECONNRESET'
    },
    headers: {}
  },
  [Symbol(SavedObjectsClientErrorCode)]: 'SavedObjectsClient/esUnavailable'
}

Terminating process...
 server crashed  with status code 1
@sorenlouv sorenlouv added the bug Fixes for quality problems that affect the customer experience label Aug 30, 2021
@botelastic botelastic bot added the needs-team Issues missing a team label label Aug 30, 2021
@sorenlouv sorenlouv changed the title ECONNRESET crashed Kibana when network connectivity is lost Kibana crashes with ECONNRESET when connection to remote cluster is lost Aug 30, 2021
@sorenlouv sorenlouv added the Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc label Aug 30, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)

@botelastic botelastic bot removed the needs-team Issues missing a team label label Aug 30, 2021
@sorenlouv
Copy link
Member Author

sorenlouv commented Aug 30, 2021

Somewhat related to #101277 (fyi @mshustov)

@mshustov
Copy link
Contributor

mshustov commented Sep 1, 2021

It looks like a plugin performing a request to the Elasticsearch server doesn't handle an exception.
It's hard to debug the problem without an error reproduction scenario or error stack. I'm not even sure we will get the error stack once the ES client drops callback-style API. (see elastic/elasticsearch-js#1542)
Have you seen any other log records with read ECONNRESET before the exception crashes the server?

@afharo
Copy link
Member

afharo commented Sep 1, 2021

It looks like a plugin performing a request to the Elasticsearch server doesn't handle an exception.

I tried to hunt it and, unfortunately, I found too many places:

I found a useful way to identify the calls making it fail: I created an ES mock that will always return 505 - and the received body to POST and PUT requests (for now). I can continue the investigation to find them all if we want to, although I'd suggest sending an email to Kibana Contributors, asking to each plugin owner to review their implementations to ensure that this is not happening.

However, we may need to document a common way to proceed when plugins need to create something: i.e.: what's the best retrying logic (maybe a core API to do so?).

@mshustov
Copy link
Contributor

mshustov commented Sep 1, 2021

although I'd suggest sending an email to Kibana Contributors, asking to each plugin owner to review their implementations to ensure that this is not happening.

Yeah, I'm not sure we are able to find all such cases.
An email is a good start to draw attention to the problem. In the long run, we should provide recommendations on error handling #99568 IIRC @kobelb said he could do it later when he had some free time.

@mshustov
Copy link
Contributor

@spalger found the same problem and fixed a few places in #111637
we discussed the problem on the last DX call, @spalger is going to give it a try https://github.com/typescript-eslint/typescript-eslint/blob/master/packages/eslint-plugin/docs/rules/no-floating-promises.md for the Core code. The main concern is it needs type information that might affect Eslint performance (see https://github.com/typescript-eslint/typescript-eslint/blob/master/docs/getting-started/linting/TYPED_LINTING.md#performance).
If the trial is successful, we could start rolling it to the other plugins.

@rudolf rudolf added the Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) label Sep 20, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@rudolf
Copy link
Contributor

rudolf commented Sep 20, 2021

@elastic/kibana-alerting-services Based on other's observations, this seems to happen after:

Observable Monitor: Hung Observable restarted after 33000ms of inactivity

It might just be coincidence though in that when ES is unavailable, task manager is the first to start failing.

@mikecote
Copy link
Contributor

@elastic/kibana-alerting-services Based on other's observations, this seems to happen after:

Observable Monitor: Hung Observable restarted after 33000ms of inactivity

It might just be coincidence though in that when ES is unavailable, task manager is the first to start failing.

I think it's a coincidence because the task manager message happens whenever it detects that it didn't poll for a certain amount of time (33000ms in this case). Under the hood, I'm guessing the event loop resumes the function in task manager on wake, sees it's been some time since it polled and logs the message. I believe I've also seen this before when putting my laptop on sleep for a while.

@gmmorris can you confirm the above statement?

@gmmorris
Copy link
Contributor

Indeed.
We have an observable that emit an event every time it's supposed to query ES.
We have another observable that emit an event after 30s and if the above mentioned observable has hung for whatever reason (and has failed to emit an event in the past 30s), then it restarts the observable.

It sounds like a symptom rather than the cause.

@kobelb kobelb added the needs-team Issues missing a team label label Jan 31, 2022
@botelastic botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022
@afharo afharo changed the title Kibana crashes with ECONNRESET when connection to remote cluster is lost Kibana crashes with ECONNRESET when connection to remote cluster is lost due to unhandled promises Jul 19, 2022
@afharo afharo self-assigned this Apr 26, 2024
@afharo
Copy link
Member

afharo commented Apr 26, 2024

#181456 might finally address this

@afharo afharo linked a pull request Apr 30, 2024 that will close this issue
66 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants