Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Admission Control mechanism for Cluster Manager APIs #7520

Closed
shwetathareja opened this issue May 11, 2023 · 2 comments · Fixed by #12496
Closed

[RFC] Admission Control mechanism for Cluster Manager APIs #7520

shwetathareja opened this issue May 11, 2023 · 2 comments · Fixed by #12496
Labels
discuss Issues intended to help drive brainstorming and decision making distributed framework enhancement Enhancement or improvement to existing feature or request idea Things we're kicking around. RFC Issues requesting major changes v2.13.0 Issues and PRs related to version 2.13.0

Comments

@shwetathareja
Copy link
Member

shwetathareja commented May 11, 2023

Is your feature request related to a problem? Please describe.
Today, Cluster Manager can be overwhelmed by sending too many requests which can cause its memory/ CPU to spike and also making its transport busy. This can have unwanted effect on the cluster with critical operations like health checks failing, node-joins/ left processing getting delayed etc. There are circuit breakers which operate based on heap memory usage and would start rejection after a certain threshold is breached. But, it can allow lot of incoming requests as it takes into account incoming request size which would be 0 for most of the get requests. Also, APIs like _cluster/health, _cluster/state which are critical for cluster functioning are not tripped over but their response payload size could be really big potentially in MBs as well. The circuit breakers also don’t handle any prioritization.

OpenSearch already supports Indexing and Search Back Pressure with intelligent resource tracking. The proposal is to build smart admission control for Cluster Manager APIs (eventually back pressure).

Describe the solution you'd like
Cluster Manager availability is critical to overall availability and stability of the cluster. The proposal here is to provide more Intelligent request rejection mechanism which takes into account the pending requests in transport thread pool queue, consider other resources like cpu, handles prioritisation during rejection etc.

For write APIs, there is ClusterManager Task Throttling which should provide protection against too many tasks but few tasks spiking up the resource usage could cause impact. Though in the first phase, the plan is to focus on read APIs only.

In future, there should also be mechanism to cancel the read requests related to admin operation like _cat, _nodes/stats, _cluster APIs which are running for long duration.


I am looking for feedback from the community to evolve this feature from an idea to concrete proposal.

@shwetathareja shwetathareja added enhancement Enhancement or improvement to existing feature or request untriaged idea Things we're kicking around. labels May 11, 2023
@anasalkouz anasalkouz added discuss Issues intended to help drive brainstorming and decision making RFC Issues requesting major changes and removed untriaged labels May 31, 2023
@shwetathareja shwetathareja changed the title [RFC] Back Pressure mechanism for Cluster Manager APIs [RFC] Admission Control / Back Pressure mechanism for Cluster Manager APIs Jun 14, 2023
@shwetathareja shwetathareja changed the title [RFC] Admission Control / Back Pressure mechanism for Cluster Manager APIs [RFC] Admission Control mechanism for Cluster Manager APIs Jun 14, 2023
@bbarani
Copy link
Member

bbarani commented Feb 6, 2024

@shwetathareja can you please confirm if this change can be included in 2.x without breaking existing API? Basically can this change be added in a backward compatible manner in 2.x line?

We are evaluating if this change requires 3.0 release or can be included in 2.x line so need your inputs.

@shwetathareja
Copy link
Member Author

@bbarani this feature will be controlled using admission control settings and threshold and can be done in backward compatible manner in 2.x. We will not enable it by default to prevent any breaking change for users in 2.x and will do it once we have 3.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Issues intended to help drive brainstorming and decision making distributed framework enhancement Enhancement or improvement to existing feature or request idea Things we're kicking around. RFC Issues requesting major changes v2.13.0 Issues and PRs related to version 2.13.0
Projects
Status: ✅ Done
Development

Successfully merging a pull request may close this issue.

5 participants