Control concurrency and add retry action in decommission flow #4684

imRishN · 2022-10-05T12:13:38Z

Description

#4084 (comment)

These are the changes implemented as part of this PR -

Add retry flag to the decommission request object. This flag can only be set internally in the service when the action needs to attempt an retry
Add a timeout for the above retry mechanism, so that the decommission action doesn't get stuck in endless loop of retrying
Now when the action feels to attempt a retry (eg. after successful abdication and local node is no longer cluster manager) it will trigger another transport action for decommission if it is not timed out.
In case the request gets abandoned due to timeout or master failures, etc, the user is expected to wipe off the metadata first and then initiate a fresh request again.
No concurrent request hence will be allowed unless a attribute is again recommissioned and no active non failed metadata is present
Now, the trickier part here is, if master switched internally and the cause was NOT the decommission action, the TransportClusterManagerNodeAction would internally attempt a retry of transport decommission action. But this attempt would fail as INIT would be registered in the metadata and request object was not updated with retry flag. The expectation here would be that user first clears the metadata and initiate a fresh request

Issues Resolved

#4543

Check List

New functionality includes testing.
- All tests pass
New functionality has been documented.
- New functionality has javadoc added
Commits are signed per the DCO using --signoff
Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Rishab Nahata <rnnahata@amazon.com>

imRishN · 2022-10-05T12:16:42Z

Adding comments from local PR here - imRishN#66

gbbafna 8 days ago
how are we making sure that user doesn't specify this parameter ? did we evaluate putting this in request context as it is an internal detail of a request and not a user facing one ?

Owner
Author
@imRishN imRishN 8 days ago
Are you talking about the retry flag or the timeout?
For flag, The REST action doesn't support accepting retry flag with the user and hence any request created at REST layer will always have it set to false. Today, it is only set to true when we trigger retry action from service
For timeout, the default is set and user can set one according to his case

do we need remainingTimeoutMS check here only ? Why do we need it ?

Owner
Author
@imRishN imRishN 8 days ago
This timeout is a check for retry eligibility only. Other actions as part of decommission has their own timeouts. This is the place where we are actually triggering the retry and hence the check. If timed out, request will not be eligible for a retry

@gbbafna gbbafna 8 days ago
can we do without this ? In worst case, the retried action will timeout and we will throw a timeout exception . This part of the code is looking out of place from readability POV.

Let me know if this answers your question

@gbbafna gbbafna 8 days ago
how are responding to user API call now ? Earlier , the retried request on new elected cluster manager would respond it . Would the same hold true now as well ?

Owner
Author
@imRishN imRishN 8 days ago
Yes, response to the retried request is attached to the listener of original request. The chain will continue if multiple retries gets executed.

imRishN · 2022-10-05T12:17:18Z

@shwetathareja @Bukhtawar need your help reviewing this

Signed-off-by: Rishab Nahata <rnnahata@amazon.com>

github-actions · 2022-10-05T12:38:30Z

Gradle Check (Jenkins) Run Completed with:

RESULT: SUCCESS ✅
URL: https://build.ci.opensearch.org/job/gradle-check/3848/
CommitID: 6a03e5e

codecov-commenter · 2022-10-05T12:40:26Z

Codecov Report

Merging #4684 (6a03e5e) into main (e189179) will increase coverage by 0.08%.
The diff coverage is 30.30%.

❗ Current head 6a03e5e differs from pull request most recent head aa498b6. Consider uploading reports for the commit aa498b6 to get more accurate results

@@             Coverage Diff              @@
##               main    #4684      +/-   ##
============================================
+ Coverage     70.52%   70.61%   +0.08%     
- Complexity    57525    57573      +48     
============================================
  Files          4654     4654              
  Lines        276978   277036      +58     
  Branches      40525    40529       +4     
============================================
+ Hits         195348   195637     +289     
+ Misses        65229    64941     -288     
- Partials      16401    16458      +57

Impacted Files	Coverage Δ
...sion/awareness/put/DecommissionRequestBuilder.java	`0.00% <0.00%> (ø)`
...ion/awareness/put/TransportDecommissionAction.java	`50.00% <0.00%> (ø)`
...h/cluster/decommission/DecommissionController.java	`65.00% <0.00%> (-15.25%)`	⬇️
...arch/cluster/decommission/DecommissionService.java	`22.43% <9.52%> (-0.97%)`	⬇️
...ecommission/awareness/put/DecommissionRequest.java	`85.29% <81.25%> (+0.29%)`	⬆️
...t/action/admin/cluster/RestDecommissionAction.java	`84.61% <100.00%> (+6.83%)`	⬆️
.../org/opensearch/client/indices/AnalyzeRequest.java	`31.00% <0.00%> (-45.00%)`	⬇️
...h/action/ingest/SimulateDocumentVerboseResult.java	`60.71% <0.00%> (-39.29%)`	⬇️
...java/org/opensearch/threadpool/ThreadPoolInfo.java	`56.25% <0.00%> (-37.50%)`	⬇️
...min/cluster/snapshots/get/GetSnapshotsRequest.java	`52.63% <0.00%> (-31.58%)`	⬇️
... and 513 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

github-actions · 2022-10-05T12:46:34Z

Gradle Check (Jenkins) Run Completed with:

RESULT: SUCCESS ✅
URL: https://build.ci.opensearch.org/job/gradle-check/3849/
CommitID: aa498b6

…rency-control

Signed-off-by: Rishab Nahata <rnnahata@amazon.com>

github-actions · 2022-10-08T17:59:27Z

Gradle Check (Jenkins) Run Completed with:

RESULT: UNSTABLE ❌
URL: https://build.ci.opensearch.org/job/gradle-check/4207/
CommitID: ddc0baf

gbbafna · 2022-10-10T04:03:29Z

server/src/main/java/org/opensearch/cluster/decommission/DecommissionController.java

+            return;
+        }
+        decommissionRequest.setRetryOnClusterManagerChange(true);
+        decommissionRequest.setRetryTimeout(TimeValue.timeValueMillis(remainingTimeoutMS));


this will not be used in the new cluster manager . this retry timeout is checked only at the the time of cluster manager abdication and not anywhere else . Hence more than not (66% in case of 3 cluster manager setup), this parameter is never used , in scope of this PR . What are the cons of removing this altogether ?

For an unstable cluster having multiple leader switches, if not given a timeout, we might get stuck into endless loop of retrying and exhausting a transport thread. This timeout, helps to control the retry action. Although I agree, we might not hit this for a stable cluster. But I feel this will help to reject a work later on which the unstable cluster might not be able to execute

gbbafna · 2022-10-10T04:08:08Z

server/src/main/java/org/opensearch/cluster/decommission/DecommissionController.java

+     * @param startTime start time of previous request
+     * @param listener callback for the retry action
+     */
+    public void retryDecommissionAction(


should we rename to tryDecommissionOnNewClusterManager as it is used only there for now ?

The controller, is more of generic methods which can be used in the service layer. This method is just attempting a retry and hence the name. The place where we are using it is during leader switch (one use case of it.

Let me know if you think it makes sense. I can rename it if required

imRishN · 2022-11-08T16:46:27Z

Closing this PR and opening #5146 which is dependent on #5093

imRishN added 2 commits October 5, 2022 17:28

Control concurrency and add retry action in decommission flow

b70a844

Signed-off-by: Rishab Nahata <rnnahata@amazon.com>

Fix spotless check

6a03e5e

Signed-off-by: Rishab Nahata <rnnahata@amazon.com>

imRishN requested review from a team and reta as code owners October 5, 2022 12:13

imRishN changed the title ~~Decommission/concurrency control~~ Control concurrency and add retry action in decommission flow Oct 5, 2022

Add Changelog

aa498b6

Signed-off-by: Rishab Nahata <rnnahata@amazon.com>

imRishN mentioned this pull request Oct 5, 2022

Decommission retry/pr imRishN/OpenSearch#66

Closed

6 tasks

imRishN added 3 commits October 8, 2022 11:11

Merge remote-tracking branch 'upstream/main' into decommission/concur…

5aa944f

…rency-control

Small fix

9c6d073

Signed-off-by: Rishab Nahata <rnnahata@amazon.com>

Fix spotless check

ddc0baf

Signed-off-by: Rishab Nahata <rnnahata@amazon.com>

gbbafna reviewed Oct 10, 2022

View reviewed changes

imRishN mentioned this pull request Oct 17, 2022

Fix decommission status update to non leader nodes #4800

Merged

6 tasks

imRishN mentioned this pull request Nov 8, 2022

[Zone Decommission] Gracefully handle concurrent zone decommission action #4543

Closed

imRishN closed this Nov 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Control concurrency and add retry action in decommission flow #4684

Control concurrency and add retry action in decommission flow #4684

imRishN commented Oct 5, 2022 •

edited

Loading

imRishN commented Oct 5, 2022

imRishN commented Oct 5, 2022

github-actions bot commented Oct 5, 2022

codecov-commenter commented Oct 5, 2022

github-actions bot commented Oct 5, 2022

github-actions bot commented Oct 8, 2022

gbbafna Oct 10, 2022

imRishN Oct 12, 2022

gbbafna Oct 10, 2022

imRishN Oct 12, 2022

imRishN commented Nov 8, 2022

Control concurrency and add retry action in decommission flow #4684

Control concurrency and add retry action in decommission flow #4684

Conversation

imRishN commented Oct 5, 2022 • edited Loading

Description

Issues Resolved

Check List

imRishN commented Oct 5, 2022

imRishN commented Oct 5, 2022

github-actions bot commented Oct 5, 2022

Gradle Check (Jenkins) Run Completed with:

codecov-commenter commented Oct 5, 2022

Codecov Report

github-actions bot commented Oct 5, 2022

Gradle Check (Jenkins) Run Completed with:

github-actions bot commented Oct 8, 2022

Gradle Check (Jenkins) Run Completed with:

gbbafna Oct 10, 2022

Choose a reason for hiding this comment

imRishN Oct 12, 2022

Choose a reason for hiding this comment

gbbafna Oct 10, 2022

Choose a reason for hiding this comment

imRishN Oct 12, 2022

Choose a reason for hiding this comment

imRishN commented Nov 8, 2022

imRishN commented Oct 5, 2022 •

edited

Loading