Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Control concurrency and add retry action in decommission flow #4684

Conversation

imRishN
Copy link
Member

@imRishN imRishN commented Oct 5, 2022

Description

#4084 (comment)

These are the changes implemented as part of this PR -

  • Add retry flag to the decommission request object. This flag can only be set internally in the service when the action needs to attempt an retry
  • Add a timeout for the above retry mechanism, so that the decommission action doesn't get stuck in endless loop of retrying
  • Now when the action feels to attempt a retry (eg. after successful abdication and local node is no longer cluster manager) it will trigger another transport action for decommission if it is not timed out.
  • In case the request gets abandoned due to timeout or master failures, etc, the user is expected to wipe off the metadata first and then initiate a fresh request again.
  • No concurrent request hence will be allowed unless a attribute is again recommissioned and no active non failed metadata is present
  • Now, the trickier part here is, if master switched internally and the cause was NOT the decommission action, the TransportClusterManagerNodeAction would internally attempt a retry of transport decommission action. But this attempt would fail as INIT would be registered in the metadata and request object was not updated with retry flag. The expectation here would be that user first clears the metadata and initiate a fresh request

Issues Resolved

#4543

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Rishab Nahata <rnnahata@amazon.com>
Signed-off-by: Rishab Nahata <rnnahata@amazon.com>
@imRishN imRishN requested review from a team and reta as code owners October 5, 2022 12:13
@imRishN
Copy link
Member Author

imRishN commented Oct 5, 2022

Adding comments from local PR here - imRishN#66


gbbafna 8 days ago
how are we making sure that user doesn't specify this parameter ? did we evaluate putting this in request context as it is an internal detail of a request and not a user facing one ?

Owner
Author
@imRishN imRishN 8 days ago
Are you talking about the retry flag or the timeout?
For flag, The REST action doesn't support accepting retry flag with the user and hence any request created at REST layer will always have it set to false. Today, it is only set to true when we trigger retry action from service
For timeout, the default is set and user can set one according to his case


do we need remainingTimeoutMS check here only ? Why do we need it ?

Owner
Author
@imRishN imRishN 8 days ago
This timeout is a check for retry eligibility only. Other actions as part of decommission has their own timeouts. This is the place where we are actually triggering the retry and hence the check. If timed out, request will not be eligible for a retry

@gbbafna gbbafna 8 days ago
can we do without this ? In worst case, the retried action will timeout and we will throw a timeout exception . This part of the code is looking out of place from readability POV.

Let me know if this answers your question


@gbbafna gbbafna 8 days ago
how are responding to user API call now ? Earlier , the retried request on new elected cluster manager would respond it . Would the same hold true now as well ?

Owner
Author
@imRishN imRishN 8 days ago
Yes, response to the retried request is attached to the listener of original request. The chain will continue if multiple retries gets executed.


@imRishN
Copy link
Member Author

imRishN commented Oct 5, 2022

@shwetathareja @Bukhtawar need your help reviewing this

@imRishN imRishN changed the title Decommission/concurrency control Control concurrency and add retry action in decommission flow Oct 5, 2022
Signed-off-by: Rishab Nahata <rnnahata@amazon.com>
@github-actions
Copy link
Contributor

github-actions bot commented Oct 5, 2022

Gradle Check (Jenkins) Run Completed with:

@codecov-commenter
Copy link

Codecov Report

Merging #4684 (6a03e5e) into main (e189179) will increase coverage by 0.08%.
The diff coverage is 30.30%.

❗ Current head 6a03e5e differs from pull request most recent head aa498b6. Consider uploading reports for the commit aa498b6 to get more accurate results

@@             Coverage Diff              @@
##               main    #4684      +/-   ##
============================================
+ Coverage     70.52%   70.61%   +0.08%     
- Complexity    57525    57573      +48     
============================================
  Files          4654     4654              
  Lines        276978   277036      +58     
  Branches      40525    40529       +4     
============================================
+ Hits         195348   195637     +289     
+ Misses        65229    64941     -288     
- Partials      16401    16458      +57     
Impacted Files Coverage Δ
...sion/awareness/put/DecommissionRequestBuilder.java 0.00% <0.00%> (ø)
...ion/awareness/put/TransportDecommissionAction.java 50.00% <0.00%> (ø)
...h/cluster/decommission/DecommissionController.java 65.00% <0.00%> (-15.25%) ⬇️
...arch/cluster/decommission/DecommissionService.java 22.43% <9.52%> (-0.97%) ⬇️
...ecommission/awareness/put/DecommissionRequest.java 85.29% <81.25%> (+0.29%) ⬆️
...t/action/admin/cluster/RestDecommissionAction.java 84.61% <100.00%> (+6.83%) ⬆️
.../org/opensearch/client/indices/AnalyzeRequest.java 31.00% <0.00%> (-45.00%) ⬇️
...h/action/ingest/SimulateDocumentVerboseResult.java 60.71% <0.00%> (-39.29%) ⬇️
...java/org/opensearch/threadpool/ThreadPoolInfo.java 56.25% <0.00%> (-37.50%) ⬇️
...min/cluster/snapshots/get/GetSnapshotsRequest.java 52.63% <0.00%> (-31.58%) ⬇️
... and 513 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@github-actions
Copy link
Contributor

github-actions bot commented Oct 5, 2022

Gradle Check (Jenkins) Run Completed with:

Signed-off-by: Rishab Nahata <rnnahata@amazon.com>
Signed-off-by: Rishab Nahata <rnnahata@amazon.com>
@github-actions
Copy link
Contributor

github-actions bot commented Oct 8, 2022

Gradle Check (Jenkins) Run Completed with:

return;
}
decommissionRequest.setRetryOnClusterManagerChange(true);
decommissionRequest.setRetryTimeout(TimeValue.timeValueMillis(remainingTimeoutMS));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will not be used in the new cluster manager . this retry timeout is checked only at the the time of cluster manager abdication and not anywhere else . Hence more than not (66% in case of 3 cluster manager setup), this parameter is never used , in scope of this PR . What are the cons of removing this altogether ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For an unstable cluster having multiple leader switches, if not given a timeout, we might get stuck into endless loop of retrying and exhausting a transport thread. This timeout, helps to control the retry action. Although I agree, we might not hit this for a stable cluster. But I feel this will help to reject a work later on which the unstable cluster might not be able to execute

* @param startTime start time of previous request
* @param listener callback for the retry action
*/
public void retryDecommissionAction(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we rename to tryDecommissionOnNewClusterManager as it is used only there for now ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The controller, is more of generic methods which can be used in the service layer. This method is just attempting a retry and hence the name. The place where we are using it is during leader switch (one use case of it.

Let me know if you think it makes sense. I can rename it if required

@imRishN
Copy link
Member Author

imRishN commented Nov 8, 2022

Closing this PR and opening #5146 which is dependent on #5093

@imRishN imRishN closed this Nov 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants