-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix decommission status update to non leader nodes #4800
Fix decommission status update to non leader nodes #4800
Conversation
Signed-off-by: Rishab Nahata <rnnahata@amazon.com>
Signed-off-by: Rishab Nahata <rnnahata@amazon.com>
Signed-off-by: Rishab Nahata <rnnahata@amazon.com>
Thanks @imRishN. Fix looks good to me. Please add tests. |
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
Codecov Report
@@ Coverage Diff @@
## main #4800 +/- ##
============================================
- Coverage 70.76% 70.69% -0.08%
+ Complexity 57926 57835 -91
============================================
Files 4689 4689
Lines 277306 277305 -1
Branches 40370 40370
============================================
- Hits 196234 196037 -197
- Misses 64822 64903 +81
- Partials 16250 16365 +115
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
Signed-off-by: Rishab Nahata <rnnahata@amazon.com>
Signed-off-by: Rishab Nahata <rnnahata@amazon.com>
* compatible open source license. | ||
*/ | ||
|
||
package org.opensearch.cluster.coordination; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added it in this package as all the integ test for decommissioning the zone need access to pkg private methods of Coordinator.
PR for integ tests - #4715
Gradle Check (Jenkins) Run Completed with:
|
Gradle Check (Jenkins) Run Completed with:
|
Signed-off-by: Rishab Nahata <rnnahata@amazon.com>
Gradle Check (Jenkins) Run Completed with:
|
// and won't receive status update to SUCCESSFUL | ||
String randomDecommissionedNode = randomFrom(clusterManagerNodes.get(2), dataNodes.get(2)); | ||
ClusterService decommissionedNodeClusterService = internalCluster().getInstance(ClusterService.class, randomDecommissionedNode); | ||
assertEquals( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assert is done with assumption that decommission can take some time and during this check status would mostly be in progress?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, when the decommissioned nodes were kicked out all the nodes would be having status as IN_PROGRESS
. Status is updated to SUCCESSFUL
after the decommissioned nodes are kicked out. And hence the last status that decommissioned nodes would be seeing is IN_PROGRESS
. If not, the test must fail
Gradle Check (Jenkins) Run Completed with:
|
@Bukhtawar Can you please review this? |
// if the current status is the expected status already or new status is FAILED, we let the check pass | ||
if (newStatus.equals(status) || newStatus.equals(DecommissionStatus.FAILED)) { | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is assuming that all steps can have a self-loop for state transitions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't get this. This is added to ensure that if any step wants to mark the status as FAILED
during decommission, we will allow it to do so. Let me know if you have any specific case in mind
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And this behaviour is same as before. Had to refactor this method a bit because we were updating the same instance which led to relative diff as 0 for decommission state update
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The self loops which we might come up during multiple concurrent requests is handled seperately as part of this PR #4684. Today we need status to get into same state as per current service implementation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other condition makes it look paranoid
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets revisit this as a part of #4684
Gradle Check (Jenkins) Run Completed with:
|
…t#4800) * Fix decommission status update to non leader nodes Signed-off-by: Rishab Nahata <rnnahata@amazon.com>
* Add DecommissionService and helper to execute awareness attribute decommissioning #4084 * Add APIs (GET/PUT) to decommission awareness attribute #4261 * Controlling discovery for decommissioned nodes #4590 * Fix decommission status update to non leader nodes #4800 * Remove redundant field from GetDecommissionStateResponse #4751 * Service Layer changes for Recommission API #4320 * Recommission api level support #4604 * Fix bug in AwarenessAttributeDecommissionIT #4822 Signed-off-by: Rishab Nahata <rnnahata@amazon.com>
* Add DecommissionService and helper to execute awareness attribute decommissioning opensearch-project#4084 * Add APIs (GET/PUT) to decommission awareness attribute opensearch-project#4261 * Controlling discovery for decommissioned nodes opensearch-project#4590 * Fix decommission status update to non leader nodes opensearch-project#4800 * Remove redundant field from GetDecommissionStateResponse opensearch-project#4751 * Service Layer changes for Recommission API opensearch-project#4320 * Recommission api level support opensearch-project#4604 * Fix bug in AwarenessAttributeDecommissionIT opensearch-project#4822 Signed-off-by: Rishab Nahata <rnnahata@amazon.com>
* Add DecommissionService and helper to execute awareness attribute decommissioning opensearch-project#4084 * Add APIs (GET/PUT) to decommission awareness attribute opensearch-project#4261 * Controlling discovery for decommissioned nodes opensearch-project#4590 * Fix decommission status update to non leader nodes opensearch-project#4800 * Remove redundant field from GetDecommissionStateResponse opensearch-project#4751 * Service Layer changes for Recommission API opensearch-project#4320 * Recommission api level support opensearch-project#4604 * Fix bug in AwarenessAttributeDecommissionIT opensearch-project#4822 Signed-off-by: Rishab Nahata <rnnahata@amazon.com>
* Awareness attribute decommission backports (#4970) * Add DecommissionService and helper to execute awareness attribute decommissioning #4084 * Add APIs (GET/PUT) to decommission awareness attribute #4261 * Controlling discovery for decommissioned nodes #4590 * Fix decommission status update to non leader nodes #4800 * Remove redundant field from GetDecommissionStateResponse #4751 * Service Layer changes for Recommission API #4320 * Recommission api level support #4604 * Fix bug in AwarenessAttributeDecommissionIT #4822 Signed-off-by: Rishab Nahata <rnnahata@amazon.com>
…t#4800) * Fix decommission status update to non leader nodes Signed-off-by: Rishab Nahata <rnnahata@amazon.com>
Description
This PR resolves decommission status update metadata bug where in the non leader nodes were not getting status update locally as the same object was updated during submit state update causing the diff to be 0. Detailed explanation can be found in the issue.
Issues Resolved
#4799
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.