-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Apply back pressure in Task Manager whenever Elasticsearch responds with a 429 #65553
Comments
Pinging @elastic/kibana-alerting-services (Team:Alerting Services) |
Needs more context - curious where this quoted text from Brandon came from ... I guess the idea is, if we get a 429 from ES, we need to stop making ES requests, for some amount of time. Probably the interval needs to be increased, which means it will be somewhat dynamic (probably just increasing from the config'd value, don't think we'd ever decrease the interval (right now anyway)). Guessing also there will be some ES requests that take priority over others. Eg, fetching new jobs should be a lower priority than marking jobs complete. |
@pmuellr I'll send you the document the quote came from. The goal would be for Task Manager to reduce the stress it's putting on Elasticsearch whenever Elasticsearch responds with a 429. I think the ideas you have are good options to solve the problem. |
I've done some thinking / research on this issue and below are my remarks. The goal is to have back pressure to make the system more resilient when users play with the task manager configuration for better throughput. If Elasticsearch returns 429 errors, it would be nice for the system to apply back pressure magically. With that goal in mind, here are the task manager settings the user can change that may cause Elasticsearch to return 429 errors more frequently:
By adding back pressure, the task manager will have less interactions with Elasticsearch. Tasks that interact with Elasticsearch itself will also benefit from this and run more successfully. There are two main thread pools to lookout for:
ProposalI’m thinking to keep it simple that it would be best to reduce the max_workers and increase the poll_interval by a percentage at a certain interval until 429 errors are no longer happening. Once the system sees no errors, we could then start increasing the max_workers and poll_interval by a percentage again at the same interval until errors happen again or we’ve reached the configured values. The value I see in reducing the max_workers is it will reduce the thundering herd that can be caused by underlying tasks right after the claiming process finishes. The value I see in reducing the poll_interval is avoiding the task manager making itself return 429 errors when the administrator configures a very low number. This proposal could also stop the process of claiming more tasks than task manager executes (#65552). An example of percentages and intervals could be to reduce the numbers by 20% every 10 seconds until there are no longer 429 errors and then start increasing the numbers again by 5% every 10 seconds until 429 errors happen again or we’ve reached the configured values. The system could then cycle between the two or remain in a reduced configuration for a longer period of time. AlternativesThere are some alternatives that can help find the right configuration for poll_interval and max_workers but adds complexity as well.
|
I like that idea. The tricky part will be how to model it reactively so that the poller takes the new polling_interval into account. |
Ya, this seems like a reasonable approach. |
From @kobelb:
The text was updated successfully, but these errors were encountered: