Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Zone Decommission] Gracefully handle concurrent zone decommission action #4543

Closed
imRishN opened this issue Sep 19, 2022 · 6 comments
Closed
Labels
distributed framework enhancement Enhancement or improvement to existing feature or request

Comments

@imRishN
Copy link
Member

imRishN commented Sep 19, 2022

#4084 (comment)

@imRishN
Copy link
Member Author

imRishN commented Sep 26, 2022

As per comment in #4084 PR -

Like @Bukhtawar suggested to have a flag in the request object which indicates whether it is retry or new request so that INIT->INIT can be allowed. All concurrent requests are rejected. In case, decommissioning gets stuck at INIT or IN_PROGRESS, user would first clear the metadata using delete API and then trigger a new decommissioning request.

I'd like to propose a few changes to DecommissionService -

  1. Add retry flag to the decommission request object. This flag can only be set internally in the service when the action needs to attempt an retry
  2. Add a timeout for the above retry mechanism, so that the decommission action doesn't get stuck in endless loop of retrying
  3. Now when the action feels to attempt a retry (eg. after successful abdication and local node is no longer cluster manager) it will trigger another transport action for decommission if it is not timed out.
  4. In case the request gets abandoned due to timeout or master failures, etc, the user is expected to wipe off the metadata first and then initiate a fresh request again.
  5. No concurrent request hence will be allowed unless a attribute is again recommissioned and no active non failed metadata is present
  6. Now, the trickier part here is, if master switched internally and the cause was NOT the decommission action, the TransportClusterManagerNodeAction would internally attempt a retry of transport decommission action. But this attempt would fail as INIT would be registered in the metadata and request object was not updated with retry flag. The expectation here would be that user first clears the metadata and initiate a fresh request

@imRishN
Copy link
Member Author

imRishN commented Sep 26, 2022

Create a PR for this change here - https://github.com/imRishN/OpenSearch/pull/66/files

@imRishN
Copy link
Member Author

imRishN commented Sep 26, 2022

@shwetathareja @Bukhtawar do let me know your thoughts on this

@shwetathareja
Copy link
Member

shwetathareja commented Oct 17, 2022

@imRishN : #4084 (comment) I want to brainstorm on the option 1 from the earlier discussion.

A new proposal - Both setting the decommission metadata with INIT state and voting exclusion can be done in a single cluster state update as atomic operation instead of separate transport call for exclusion. The decommission request would only do this change in sync and rest of the operations including clearing of voting config can happen in the background. This would simplify the code significantly. No need to worry about concurrent executions.

@imRishN
Copy link
Member Author

imRishN commented Nov 8, 2022

@shwetathareja Created a PR for the option 1 - #5093
Also, I want to call out that this PR would still not be able to restrict the concurrency. I'll create one more PR on top of this PR to gracefully handle the concurrency.
I'll go ahead and close #4684 and will open one for controlling concurrency once we close on approach of #5093

@imRishN
Copy link
Member Author

imRishN commented Jan 10, 2023

Resolving since the PR got merged

@imRishN imRishN closed this as completed Jan 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
distributed framework enhancement Enhancement or improvement to existing feature or request
Projects
None yet
Development

No branches or pull requests

3 participants