Apply back pressure in Task Manager whenever Elasticsearch responds with a 429 #65553

mikecote · 2020-05-06T19:02:45Z

The 429 error can occur within task manager itself, or within an individual task

elasticmachine · 2020-05-06T19:02:47Z

Pinging @elastic/kibana-alerting-services (Team:Alerting Services)

pmuellr · 2020-07-21T16:05:46Z

Needs more context - curious where this quoted text from Brandon came from ...

I guess the idea is, if we get a 429 from ES, we need to stop making ES requests, for some amount of time. Probably the interval needs to be increased, which means it will be somewhat dynamic (probably just increasing from the config'd value, don't think we'd ever decrease the interval (right now anyway)).

Guessing also there will be some ES requests that take priority over others. Eg, fetching new jobs should be a lower priority than marking jobs complete.

mikecote · 2020-07-22T19:17:43Z

@pmuellr I'll send you the document the quote came from. The goal would be for Task Manager to reduce the stress it's putting on Elasticsearch whenever Elasticsearch responds with a 429. I think the ideas you have are good options to solve the problem.

mikecote · 2020-08-14T13:33:04Z

I've done some thinking / research on this issue and below are my remarks.

The goal is to have back pressure to make the system more resilient when users play with the task manager configuration for better throughput. If Elasticsearch returns 429 errors, it would be nice for the system to apply back pressure magically.

With that goal in mind, here are the task manager settings the user can change that may cause Elasticsearch to return 429 errors more frequently:

xpack.task_manager.poll_interval by decreasing the number
xpack.task_manager.request_capacity by increasing the number and having more ad-hoc run now requests
xpack.task_manager.max_workers by increasing the number

By adding back pressure, the task manager will have less interactions with Elasticsearch. Tasks that interact with Elasticsearch itself will also benefit from this and run more successfully.

There are two main thread pools to lookout for:

Write: Task manager uses this heavily to update, schedule, remove and claim tasks
Search: Task manager uses this in the task claiming process while also some tasks themselves will do queries

Proposal

I’m thinking to keep it simple that it would be best to reduce the max_workers and increase the poll_interval by a percentage at a certain interval until 429 errors are no longer happening. Once the system sees no errors, we could then start increasing the max_workers and poll_interval by a percentage again at the same interval until errors happen again or we’ve reached the configured values.

The value I see in reducing the max_workers is it will reduce the thundering herd that can be caused by underlying tasks right after the claiming process finishes.

The value I see in reducing the poll_interval is avoiding the task manager making itself return 429 errors when the administrator configures a very low number.

This proposal could also stop the process of claiming more tasks than task manager executes (#65552).

An example of percentages and intervals could be to reduce the numbers by 20% every 10 seconds until there are no longer 429 errors and then start increasing the numbers again by 5% every 10 seconds until 429 errors happen again or we’ve reached the configured values. The system could then cycle between the two or remain in a reduced configuration for a longer period of time.

Alternatives

There are some alternatives that can help find the right configuration for poll_interval and max_workers but adds complexity as well.

We could keep an in memory interval history of the success vs failures that the task manager sees in its requests. When it detects failures, we can calculate the percentages of failures and reduce the number by a calculated percentage so it gets back to a non-erroring state quicker. The same could apply when increasing the number afterwards.
We could ping the stats API to see how full the threads are but this would add more stress to Elasticsearch and asynchronous dependencies.
We could reduce the poll_interval, claim more tasks than needed (Eliminate the downtime between tasks completing and the next polling interval #65552) and reduce the concurrency of tasks in process. This would make task manager talk to Elasticsearch less frequently to get the same amount of work. This could be in combination with the proposal above.

gmmorris · 2020-08-17T14:00:23Z

I like that idea.
I think it's far more elegant than the original idea we discussed of slowing down the poller directly.

The tricky part will be how to model it reactively so that the poller takes the new polling_interval into account.
Max worker is checked all the time, but poll_interval initialises an RX interval once and isn't looked at again... I'm not sure how best to model that, but definitely doable.
👍

pmuellr · 2020-08-17T15:24:29Z

Ya, this seems like a reasonable approach.

mikecote added Feature:Task Manager Team:ResponseOps Label for the ResponseOps team (formerly the Cases and Alerting teams) labels May 6, 2020

gmmorris assigned gmmorris and unassigned gmmorris Jul 22, 2020

mikecote self-assigned this Aug 5, 2020

mikecote mentioned this issue Aug 11, 2020

Alerting GA #74788

Closed

36 tasks

mikecote added the discuss label Aug 14, 2020

mikecote removed the discuss label Aug 31, 2020

mikecote mentioned this issue Sep 9, 2020

Temporarily apply back pressure to maxWorkers and pollInterval when 429 errors occur #77096

Merged

mikecote mentioned this issue Sep 29, 2020

Modify default Task Manager configuration for better throughput? #78851

Closed

mikecote closed this as completed in #75666 Oct 13, 2020

gmmorris mentioned this issue Jan 18, 2021

[Discuss] Should we stagger requests to Elasticsearch when Alerts clump up? #54697

Closed

kobelb added the needs-team Issues missing a team label label Jan 31, 2022

botelastic bot removed the needs-team Issues missing a team label label Jan 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Apply back pressure in Task Manager whenever Elasticsearch responds with a 429 #65553

Apply back pressure in Task Manager whenever Elasticsearch responds with a 429 #65553

mikecote commented May 6, 2020

elasticmachine commented May 6, 2020

pmuellr commented Jul 21, 2020

mikecote commented Jul 22, 2020

mikecote commented Aug 14, 2020 •

edited

Loading

gmmorris commented Aug 17, 2020 •

edited

Loading

pmuellr commented Aug 17, 2020

Apply back pressure in Task Manager whenever Elasticsearch responds with a 429 #65553

Apply back pressure in Task Manager whenever Elasticsearch responds with a 429 #65553

Comments

mikecote commented May 6, 2020

elasticmachine commented May 6, 2020

pmuellr commented Jul 21, 2020

mikecote commented Jul 22, 2020

mikecote commented Aug 14, 2020 • edited Loading

Proposal

Alternatives

gmmorris commented Aug 17, 2020 • edited Loading

pmuellr commented Aug 17, 2020

mikecote commented Aug 14, 2020 •

edited

Loading

gmmorris commented Aug 17, 2020 •

edited

Loading