Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for decommissioning and recommissioning a zone #3402

Open
Bukhtawar opened this issue May 19, 2022 · 1 comment
Open

Support for decommissioning and recommissioning a zone #3402

Bukhtawar opened this issue May 19, 2022 · 1 comment
Labels
discuss Issues intended to help drive brainstorming and decision making enhancement Enhancement or improvement to existing feature or request

Comments

@Bukhtawar
Copy link
Collaborator

Bukhtawar commented May 19, 2022

Is your feature request related to a problem? Please describe.
There are use-case where decommissioning a zone/rack might be beneficial

  1. In a multi-zone deployment setup it might be good to support zonal deployment rather than a rolling restart per node, which might be too slow for a big cluster and might take sufficiently longer. In such a case zone serves a unit of deployment
  2. Andon cord during zonal outages is another handy mechanism which enables graceful traffic shutdown in the impacted zone/rack espl when certain nodes are still operating in a degraded manner

Background
With #2859 we intend to weigh away shard search traffic, however since OpenSearch follows a synchronous replication model, it is possible to have replication request stuck due to any impairment in the write path. The current health check mechanisms to detect and remediate a bad node is only a best effort strategy and doesn't cover deeper health checks across all network paths. For predictability, we propose pulling an andon cord to cut-off inter-zone replication traffic, which can be achieved by decommissioning the node in the impacted zone.

Implications
As a result of decommissioning a zone all shards that were taking in write traffic might fail to ensure data consistency semantics are honoured and stale shards are marked unavailable. To make sure no in-flight requests fail we need to weigh away shard search traffic as a part of #2859. In some setups where there are no dedicated coordinator setups we need to ensure no HTTP traffic is being sent and all traffic is drained before a decommission API is triggered

Describe the solution you'd like
A graceful mechanism to

  1. Decommission a zone/rack
  2. Recommission a zone/rack
POST /_cluster/decommission 
{
 "awareness_attribute" : {"zone" : "A-0"}
}
DELETE /_cluster/decommission 
{
 "awareness_attribute" : {"zone" : "A-0"}
}

Constraints

  1. Attribute value should be a one of the values in union(forces_zone, discovered_zone)
  2. There should be only one active zone under decommission or recommission
  3. The shard request weights on the decommissioned zone from [Feature] Support for weighted zonal search request routing policy #2859 should be set to zero

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

@Bukhtawar Bukhtawar added enhancement Enhancement or improvement to existing feature or request untriaged labels May 19, 2022
@adnapibar
Copy link
Contributor

@Bukhtawar Can you provide more details ?

@Bukhtawar Bukhtawar added discuss Issues intended to help drive brainstorming and decision making and removed untriaged labels May 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Issues intended to help drive brainstorming and decision making enhancement Enhancement or improvement to existing feature or request
Projects
None yet
Development

No branches or pull requests

2 participants