Support for decommissioning and recommissioning a zone #3402
Labels
discuss
Issues intended to help drive brainstorming and decision making
enhancement
Enhancement or improvement to existing feature or request
Is your feature request related to a problem? Please describe.
There are use-case where decommissioning a zone/rack might be beneficial
Background
With #2859 we intend to weigh away shard search traffic, however since OpenSearch follows a synchronous replication model, it is possible to have replication request stuck due to any impairment in the write path. The current health check mechanisms to detect and remediate a bad node is only a best effort strategy and doesn't cover deeper health checks across all network paths. For predictability, we propose pulling an andon cord to cut-off inter-zone replication traffic, which can be achieved by decommissioning the node in the impacted zone.
Implications
As a result of decommissioning a zone all shards that were taking in write traffic might fail to ensure data consistency semantics are honoured and stale shards are marked unavailable. To make sure no in-flight requests fail we need to weigh away shard search traffic as a part of #2859. In some setups where there are no dedicated coordinator setups we need to ensure no HTTP traffic is being sent and all traffic is drained before a decommission API is triggered
Describe the solution you'd like
A graceful mechanism to
Constraints
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
Add any other context or screenshots about the feature request here.
The text was updated successfully, but these errors were encountered: