Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix decommission status update to non leader nodes #4800

Merged
merged 9 commits into from
Oct 18, 2022

Conversation

imRishN
Copy link
Member

@imRishN imRishN commented Oct 16, 2022

Description

This PR resolves decommission status update metadata bug where in the non leader nodes were not getting status update locally as the same object was updated during submit state update causing the diff to be 0. Detailed explanation can be found in the issue.

Issues Resolved

#4799

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Rishab Nahata <rnnahata@amazon.com>
Signed-off-by: Rishab Nahata <rnnahata@amazon.com>
@imRishN imRishN requested review from a team and reta as code owners October 16, 2022 06:29
@imRishN imRishN changed the title Decommission/metadata bug Fix decommission status update to non leader nodes Oct 16, 2022
Signed-off-by: Rishab Nahata <rnnahata@amazon.com>
@anshu1106
Copy link
Contributor

Thanks @imRishN. Fix looks good to me. Please add tests.

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@codecov-commenter
Copy link

codecov-commenter commented Oct 16, 2022

Codecov Report

Merging #4800 (dd8b4ec) into main (1d65485) will decrease coverage by 0.07%.
The diff coverage is 57.14%.

@@             Coverage Diff              @@
##               main    #4800      +/-   ##
============================================
- Coverage     70.76%   70.69%   -0.08%     
+ Complexity    57926    57835      -91     
============================================
  Files          4689     4689              
  Lines        277306   277305       -1     
  Branches      40370    40370              
============================================
- Hits         196234   196037     -197     
- Misses        64822    64903      +81     
- Partials      16250    16365     +115     
Impacted Files Coverage Δ
...er/decommission/DecommissionAttributeMetadata.java 65.78% <25.00%> (+1.23%) ⬆️
...h/cluster/decommission/DecommissionController.java 80.72% <100.00%> (+0.47%) ⬆️
...n/indices/forcemerge/ForceMergeRequestBuilder.java 0.00% <0.00%> (-75.00%) ⬇️
...a/org/opensearch/client/cluster/ProxyModeInfo.java 0.00% <0.00%> (-55.00%) ⬇️
.../java/org/opensearch/node/NodeClosedException.java 50.00% <0.00%> (-50.00%) ⬇️
...ava/org/opensearch/action/NoSuchNodeException.java 0.00% <0.00%> (-50.00%) ⬇️
.../java/org/opensearch/search/dfs/AggregatedDfs.java 51.61% <0.00%> (-45.17%) ⬇️
.../opensearch/client/indices/CloseIndexResponse.java 48.75% <0.00%> (-42.50%) ⬇️
.../admin/cluster/reroute/ClusterRerouteResponse.java 60.00% <0.00%> (-40.00%) ⬇️
...h/action/ingest/SimulateDocumentVerboseResult.java 60.71% <0.00%> (-39.29%) ⬇️
... and 489 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Signed-off-by: Rishab Nahata <rnnahata@amazon.com>
Signed-off-by: Rishab Nahata <rnnahata@amazon.com>
Signed-off-by: Rishab Nahata <rnnahata@amazon.com>
Signed-off-by: Rishab Nahata <rnnahata@amazon.com>
* compatible open source license.
*/

package org.opensearch.cluster.coordination;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added it in this package as all the integ test for decommissioning the zone need access to pkg private methods of Coordinator.
PR for integ tests - #4715

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

Signed-off-by: Rishab Nahata <rnnahata@amazon.com>
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

// and won't receive status update to SUCCESSFUL
String randomDecommissionedNode = randomFrom(clusterManagerNodes.get(2), dataNodes.get(2));
ClusterService decommissionedNodeClusterService = internalCluster().getInstance(ClusterService.class, randomDecommissionedNode);
assertEquals(
Copy link
Contributor

@anshu1106 anshu1106 Oct 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assert is done with assumption that decommission can take some time and during this check status would mostly be in progress?

Copy link
Member Author

@imRishN imRishN Oct 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, when the decommissioned nodes were kicked out all the nodes would be having status as IN_PROGRESS. Status is updated to SUCCESSFUL after the decommissioned nodes are kicked out. And hence the last status that decommissioned nodes would be seeing is IN_PROGRESS. If not, the test must fail

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@imRishN
Copy link
Member Author

imRishN commented Oct 16, 2022

@Bukhtawar Can you please review this?

Comment on lines +85 to +87
// if the current status is the expected status already or new status is FAILED, we let the check pass
if (newStatus.equals(status) || newStatus.equals(DecommissionStatus.FAILED)) {
return;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is assuming that all steps can have a self-loop for state transitions

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't get this. This is added to ensure that if any step wants to mark the status as FAILED during decommission, we will allow it to do so. Let me know if you have any specific case in mind

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this behaviour is same as before. Had to refactor this method a bit because we were updating the same instance which led to relative diff as 0 for decommission state update

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The self loops which we might come up during multiple concurrent requests is handled seperately as part of this PR #4684. Today we need status to get into same state as per current service implementation

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other condition makes it look paranoid

Copy link
Collaborator

@Bukhtawar Bukhtawar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets revisit this as a part of #4684

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@Bukhtawar Bukhtawar merged commit cdc7a2f into opensearch-project:main Oct 18, 2022
imRishN added a commit to imRishN/OpenSearch that referenced this pull request Oct 28, 2022
…t#4800)

* Fix decommission status update to non leader nodes

Signed-off-by: Rishab Nahata <rnnahata@amazon.com>
Bukhtawar pushed a commit that referenced this pull request Nov 2, 2022
* Add DecommissionService and helper to execute awareness attribute decommissioning #4084
* Add APIs (GET/PUT) to decommission awareness attribute #4261
* Controlling discovery for decommissioned nodes #4590
* Fix decommission status update to non leader nodes #4800
* Remove redundant field from GetDecommissionStateResponse #4751
* Service Layer changes for Recommission API #4320
* Recommission api level support #4604
* Fix bug in AwarenessAttributeDecommissionIT #4822

Signed-off-by: Rishab Nahata <rnnahata@amazon.com>
imRishN added a commit to imRishN/OpenSearch that referenced this pull request Nov 3, 2022
* Add DecommissionService and helper to execute awareness attribute decommissioning opensearch-project#4084
* Add APIs (GET/PUT) to decommission awareness attribute opensearch-project#4261
* Controlling discovery for decommissioned nodes opensearch-project#4590
* Fix decommission status update to non leader nodes opensearch-project#4800
* Remove redundant field from GetDecommissionStateResponse opensearch-project#4751
* Service Layer changes for Recommission API opensearch-project#4320
* Recommission api level support opensearch-project#4604
* Fix bug in AwarenessAttributeDecommissionIT opensearch-project#4822

Signed-off-by: Rishab Nahata <rnnahata@amazon.com>
imRishN added a commit to imRishN/OpenSearch that referenced this pull request Nov 3, 2022
* Add DecommissionService and helper to execute awareness attribute decommissioning opensearch-project#4084
* Add APIs (GET/PUT) to decommission awareness attribute opensearch-project#4261
* Controlling discovery for decommissioned nodes opensearch-project#4590
* Fix decommission status update to non leader nodes opensearch-project#4800
* Remove redundant field from GetDecommissionStateResponse opensearch-project#4751
* Service Layer changes for Recommission API opensearch-project#4320
* Recommission api level support opensearch-project#4604
* Fix bug in AwarenessAttributeDecommissionIT opensearch-project#4822

Signed-off-by: Rishab Nahata <rnnahata@amazon.com>
Bukhtawar pushed a commit that referenced this pull request Nov 3, 2022
* Awareness attribute decommission backports (#4970)
* Add DecommissionService and helper to execute awareness attribute decommissioning #4084
* Add APIs (GET/PUT) to decommission awareness attribute #4261
* Controlling discovery for decommissioned nodes #4590
* Fix decommission status update to non leader nodes #4800
* Remove redundant field from GetDecommissionStateResponse #4751
* Service Layer changes for Recommission API #4320
* Recommission api level support #4604
* Fix bug in AwarenessAttributeDecommissionIT #4822

Signed-off-by: Rishab Nahata <rnnahata@amazon.com>
ashking94 pushed a commit to ashking94/OpenSearch that referenced this pull request Nov 7, 2022
…t#4800)

* Fix decommission status update to non leader nodes 

Signed-off-by: Rishab Nahata <rnnahata@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants