Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster load issues due to "Run beyond timeout" #73443

Closed
flash1293 opened this issue Jul 28, 2020 · 9 comments · Fixed by #73712
Closed

Cluster load issues due to "Run beyond timeout" #73443

flash1293 opened this issue Jul 28, 2020 · 9 comments · Fixed by #73712
Assignees
Labels
blocker bug Fixes for quality problems that affect the customer experience Feature:Search Querying infrastructure in Kibana v7.9.0

Comments

@flash1293
Copy link
Contributor

Kibana version: 7.7 upwards

Describe the bug: The "Run beyond timeout" feature will let searches run indefinitely when the user clicks the button in the prompt. This can cause issues with cluster loads and in some cases even bring the cluster down because extremely costly searches (running for multiple hours), can go unnoticed and continue to run in Elasticsearch if the user abandons the Dashboard after a while.

Steps to reproduce:

  1. Go to a dashboard
  2. Start some really expensive search (e.g. spanning a lot of data or including very costly aggregations)
  3. Hit the "Run beyond timeout" button
  4. Close the browser window
  5. An hour later, the search task is still running in Elasticsearch

Expected behavior:
There are different options how to improve the situation:

  • Put a very noticable warning in the prompt about what will happen and how to recover from it
  • Pressing the button won't remove the timeout completely, but set it to 10mins or something similar - then the user has to accept again (or maybe there is a dropdown in the prompt letting the user choose how long to run it beyond the timeout)
  • When the user navigates away from the dashboard, the task is aborted (probably not helpful for "running in background")
  • Kibana checks background tasks and warns about detached long running ones

This is probably not a bug, but it's easy to misuse the feature in practice.

@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-app-arch (Team:AppArch)

@flash1293
Copy link
Contributor Author

flash1293 commented Jul 28, 2020

cc @Dosant

@Dosant Dosant added the Feature:Search Querying infrastructure in Kibana label Jul 28, 2020
@lukasolson
Copy link
Member

Have we verified that the search tasks are not cleaned up when the user navigates away or closes the browser? If so, that's a bug in Kibana for sure.

@flash1293
Copy link
Contributor Author

@lukasolson I saw search tasks with multi hour runtimes in production clusters with a description async_search.

@flash1293
Copy link
Contributor Author

flash1293 commented Jul 28, 2020

@lukasolson I looked into this together with @Dosant and we were under the impression there was no logic in place to do that cleanup after the user has hit "Run beyond timeout". If there is a mechanism for that, we should verify whether it's actually caused by Kibana (e.g. because it doesn't work 100% of the time) or whether those users simply triggered those searches manually or via other integrations.

@lukasolson
Copy link
Member

Okay, so I spent some time looking into this yesterday. There are a few scenarios where we would want to cancel async search requests:

  1. When a user navigates away (still within Kibana)
  2. When a user re-submits the query (or changes the query and re-fetches)
  3. When a user navigates outside of Kibana
  4. When a user closes the tab and/or browser

We are properly handling the first two cases, but not the last two. I was under the impression that the destroy lifecycle that the second case relies on would make the last two work, but when a user navigates outside of Kibana or closes the tab, that lifecycle doesn't fire.

After talking with @lizozom, there's a simple solution for this. We can send the keep_alive parameter in our requests to _async_search. We can set it to something like one minute. Each time we send another request, the TTL for the task in ES will be extended. If the user closes the browser or navigates outside of Kibana, no more requests will be fired, and after one minute, the task will be cancelled in ES.

@lukasolson lukasolson added 7.9.0 blocker bug Fixes for quality problems that affect the customer experience labels Jul 29, 2020
@lukasolson
Copy link
Member

I wanted to add that this will not solve this issue entirely. Because there is no prioritization of search requests in Elasticsearch, large async search requests will run at the same priority as internal Kibana requests. As a result, there will still likely be lots of scenarios where large async search requests will take down Kibana. The solution for this would be something inside Elasticsearch. The currently proposed issue for this can be found here: elastic/elasticsearch#37867

@flash1293
Copy link
Contributor Author

@lukasolson I've also seen cancel tasks running for the same amount of time as the orphaned searches. Maybe this means even if the cancellation request is sent it won't work every time. Might be worthing looking into this as well:

"***": {
          "node": "***",
          "id": 96953495,
          "type": "transport",
          "action": "cluster:admin/tasks/cancel",
          "start_time": "2020-07-21T11:15:57.255Z",
          "start_time_in_millis": 1595330157255,
          "running_time": "9.9d",
          "running_time_in_nanos": 859004843218882,
          "cancellable": false,
          "headers": {}
        },
 "***": {
          "node": "***",
          "id": 96951318,
          "type": "transport",
          "action": "indices:data/read/search",
          "start_time": "2020-07-21T11:13:57.442Z",
          "start_time_in_millis": 1595330037442,
          "running_time": "9.9d",
          "running_time_in_nanos": 859506190190270,
          "cancellable": true,
          "parent_task_id": "***",
          "headers": {}
        },

@tomcallahan
Copy link

@jimczi @dnhatn I'm not sure exactly where this cancellation would fall, but can one of you say if we have an issue in ES on the cancellation side from @lukasolson 's post?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocker bug Fixes for quality problems that affect the customer experience Feature:Search Querying infrastructure in Kibana v7.9.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants