Support for decommissioning and recommissioning a zone #3402

Bukhtawar · 2022-05-19T10:07:14Z

Is your feature request related to a problem? Please describe.
There are use-case where decommissioning a zone/rack might be beneficial

In a multi-zone deployment setup it might be good to support zonal deployment rather than a rolling restart per node, which might be too slow for a big cluster and might take sufficiently longer. In such a case zone serves a unit of deployment
Andon cord during zonal outages is another handy mechanism which enables graceful traffic shutdown in the impacted zone/rack espl when certain nodes are still operating in a degraded manner

Background
With #2859 we intend to weigh away shard search traffic, however since OpenSearch follows a synchronous replication model, it is possible to have replication request stuck due to any impairment in the write path. The current health check mechanisms to detect and remediate a bad node is only a best effort strategy and doesn't cover deeper health checks across all network paths. For predictability, we propose pulling an andon cord to cut-off inter-zone replication traffic, which can be achieved by decommissioning the node in the impacted zone.

Implications
As a result of decommissioning a zone all shards that were taking in write traffic might fail to ensure data consistency semantics are honoured and stale shards are marked unavailable. To make sure no in-flight requests fail we need to weigh away shard search traffic as a part of #2859. In some setups where there are no dedicated coordinator setups we need to ensure no HTTP traffic is being sent and all traffic is drained before a decommission API is triggered

Describe the solution you'd like
A graceful mechanism to

Decommission a zone/rack
Recommission a zone/rack

POST /_cluster/decommission 
{
 "awareness_attribute" : {"zone" : "A-0"}
}

DELETE /_cluster/decommission 
{
 "awareness_attribute" : {"zone" : "A-0"}
}

Constraints

Attribute value should be a one of the values in union(forces_zone, discovered_zone)
There should be only one active zone under decommission or recommission
The shard request weights on the decommissioned zone from [Feature] Support for weighted zonal search request routing policy #2859 should be set to zero

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

The text was updated successfully, but these errors were encountered:

adnapibar · 2022-05-23T19:01:37Z

@Bukhtawar Can you provide more details ?

Bukhtawar added enhancement Enhancement or improvement to existing feature or request untriaged labels May 19, 2022

Bukhtawar added discuss Issues intended to help drive brainstorming and decision making and removed untriaged labels May 24, 2022

imRishN mentioned this issue Jun 21, 2022

[RFC] API for decommissioning/recommissioning zone and weighted zonal search request routing policy #3639

Closed

imRishN mentioned this issue Aug 2, 2022

[Zone Decommission] Add DecommissionService and helper to execute awareness attribute decommissioning #4083

Closed

pranikum mentioned this issue Oct 12, 2022

[WIP]: Recommission Integration Tests #4744

Closed

6 tasks

imRishN mentioned this issue Nov 4, 2022

[DOC] Awareness Attribute Decommission opensearch-project/documentation-website#1812

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for decommissioning and recommissioning a zone #3402

Support for decommissioning and recommissioning a zone #3402

Bukhtawar commented May 19, 2022 •

edited

Loading

adnapibar commented May 23, 2022

Support for decommissioning and recommissioning a zone #3402

Support for decommissioning and recommissioning a zone #3402

Comments

Bukhtawar commented May 19, 2022 • edited Loading

adnapibar commented May 23, 2022

Bukhtawar commented May 19, 2022 •

edited

Loading