Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apply back pressure in Task Manager whenever Elasticsearch responds with a 429 #65553

Closed
mikecote opened this issue May 6, 2020 · 6 comments · Fixed by #75666
Closed

Apply back pressure in Task Manager whenever Elasticsearch responds with a 429 #65553

mikecote opened this issue May 6, 2020 · 6 comments · Fixed by #75666
Assignees
Labels
Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)

Comments

@mikecote
Copy link
Contributor

mikecote commented May 6, 2020

From @kobelb:

The 429 error can occur within task manager itself, or within an individual task

@mikecote mikecote added Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels May 6, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

@pmuellr
Copy link
Member

pmuellr commented Jul 21, 2020

Needs more context - curious where this quoted text from Brandon came from ...

I guess the idea is, if we get a 429 from ES, we need to stop making ES requests, for some amount of time. Probably the interval needs to be increased, which means it will be somewhat dynamic (probably just increasing from the config'd value, don't think we'd ever decrease the interval (right now anyway)).

Guessing also there will be some ES requests that take priority over others. Eg, fetching new jobs should be a lower priority than marking jobs complete.

@gmmorris gmmorris assigned gmmorris and unassigned gmmorris Jul 22, 2020
@mikecote
Copy link
Contributor Author

@pmuellr I'll send you the document the quote came from. The goal would be for Task Manager to reduce the stress it's putting on Elasticsearch whenever Elasticsearch responds with a 429. I think the ideas you have are good options to solve the problem.

@mikecote mikecote self-assigned this Aug 5, 2020
@mikecote mikecote mentioned this issue Aug 11, 2020
36 tasks
@mikecote
Copy link
Contributor Author

mikecote commented Aug 14, 2020

I've done some thinking / research on this issue and below are my remarks.

The goal is to have back pressure to make the system more resilient when users play with the task manager configuration for better throughput. If Elasticsearch returns 429 errors, it would be nice for the system to apply back pressure magically.

With that goal in mind, here are the task manager settings the user can change that may cause Elasticsearch to return 429 errors more frequently:

  • xpack.task_manager.poll_interval by decreasing the number
  • xpack.task_manager.request_capacity by increasing the number and having more ad-hoc run now requests
  • xpack.task_manager.max_workers by increasing the number

By adding back pressure, the task manager will have less interactions with Elasticsearch. Tasks that interact with Elasticsearch itself will also benefit from this and run more successfully.

There are two main thread pools to lookout for:

  1. Write: Task manager uses this heavily to update, schedule, remove and claim tasks
  2. Search: Task manager uses this in the task claiming process while also some tasks themselves will do queries

Proposal

I’m thinking to keep it simple that it would be best to reduce the max_workers and increase the poll_interval by a percentage at a certain interval until 429 errors are no longer happening. Once the system sees no errors, we could then start increasing the max_workers and poll_interval by a percentage again at the same interval until errors happen again or we’ve reached the configured values.

The value I see in reducing the max_workers is it will reduce the thundering herd that can be caused by underlying tasks right after the claiming process finishes.

The value I see in reducing the poll_interval is avoiding the task manager making itself return 429 errors when the administrator configures a very low number.

This proposal could also stop the process of claiming more tasks than task manager executes (#65552).

An example of percentages and intervals could be to reduce the numbers by 20% every 10 seconds until there are no longer 429 errors and then start increasing the numbers again by 5% every 10 seconds until 429 errors happen again or we’ve reached the configured values. The system could then cycle between the two or remain in a reduced configuration for a longer period of time.

Alternatives

There are some alternatives that can help find the right configuration for poll_interval and max_workers but adds complexity as well.

  • We could keep an in memory interval history of the success vs failures that the task manager sees in its requests. When it detects failures, we can calculate the percentages of failures and reduce the number by a calculated percentage so it gets back to a non-erroring state quicker. The same could apply when increasing the number afterwards.

  • We could ping the stats API to see how full the threads are but this would add more stress to Elasticsearch and asynchronous dependencies.

  • We could reduce the poll_interval, claim more tasks than needed (Eliminate the downtime between tasks completing and the next polling interval #65552) and reduce the concurrency of tasks in process. This would make task manager talk to Elasticsearch less frequently to get the same amount of work. This could be in combination with the proposal above.

@gmmorris
Copy link
Contributor

gmmorris commented Aug 17, 2020

I like that idea.
I think it's far more elegant than the original idea we discussed of slowing down the poller directly.

The tricky part will be how to model it reactively so that the poller takes the new polling_interval into account.
Max worker is checked all the time, but poll_interval initialises an RX interval once and isn't looked at again... I'm not sure how best to model that, but definitely doable.
👍

@pmuellr
Copy link
Member

pmuellr commented Aug 17, 2020

Ya, this seems like a reasonable approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants