Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allocation of new shards on remote store nodes #145

Open
wants to merge 4 commits into
base: mixed-mode
Choose a base branch
from

Conversation

gbbafna
Copy link
Owner

@gbbafna gbbafna commented Feb 13, 2024

Description

[Describe what this change achieves]

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)
  • Public documentation issue/PR created

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

gbbafna and others added 3 commits February 8, 2024 09:27
Signed-off-by: Gaurav Bafna <gbbafna@amazon.com>
Signed-off-by: Gaurav Bafna <gbbafna@amazon.com>
Copy link
Collaborator

@shourya035 shourya035 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall. Please add the extra set of assertions on the new ITs

if (direction.equals(Direction.REMOTE_STORE)) {
if (!primaryShardNode.isRemoteStoreNode() && targetNode.isRemoteStoreNode()) {
return allocation.decision(Decision.NO, NAME,
"can not allocate replica shard on a remote node when primary shard is not already active on some remote node");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Can we change the message to something like below:

cannot allocate replica shard copy on a remote node since primary shard copy is not yet migrated to remote

logger.info(" --> verify non-allocation of primary shard");
RoutingTable routingTable = client.admin().cluster().prepareState().execute().actionGet().getState().getRoutingTable();
ShardRouting primaryShardRouting = routingTable.index(TEST_INDEX).shard(0).primaryShard();
assertFalse(primaryShardRouting.active());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also add some extra set of assertions for this test:

  • Cluster health should be RED (since primary shard is not allocated yet)
  • Run the _cluster/allocation/explain API and assert on the reason for shard allocation. The reason should match the String we publish from the decider on a NO decision

logger.info(" --> verify non-allocation of replica shard");
routingTable = client.admin().cluster().prepareState().execute().actionGet().getState().getRoutingTable();
replicaShardRouting = routingTable.index(TEST_INDEX).shard(0).replicaShards().get(0);
assertFalse(replicaShardRouting.active());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above. The assertion should be on the reason for UNASSIGNED shard and the cluster state should be yellow.

logger.info(" --> verify allocation of replica shard");
routingTable = client.admin().cluster().prepareState().execute().actionGet().getState().getRoutingTable();
replicaShardRouting = routingTable.index(TEST_INDEX).shard(0).replicaShards().get(0);
assertTrue(replicaShardRouting.active());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the same assertions here also, as mentioned above

clusterSettings.addSettingsUpdateConsumer(RemoteStoreNodeService.DIRECTION_SETTING, this::setDirection);
}

private void setDirection (Direction direction) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename Direction better to maybe MigrationRoute something that is more explicit

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack.

*
* @opensearch.internal
*/
public class RemoteStoreAllocationDecider extends AllocationDecider {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to have version checks as well to ensure remote store nodes are not on a higher version than the rest of the existing nodes

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Lets name it RemoteStoreMigrationAllocationDecider .

@gbbafna gbbafna force-pushed the mixed-mode branch 2 times, most recently from 55a8ef9 to 7161727 Compare February 15, 2024 12:28
*
* @opensearch.internal
*/
public class RemoteStoreAllocationDecider extends AllocationDecider {
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. Lets name it RemoteStoreMigrationAllocationDecider .

.filter(nd -> nd.nodeId().equals(primaryShardRouting.currentNodeId()))
.findFirst().get().node();

if (direction.equals(Direction.REMOTE_STORE)) {
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets handle the other direction : DOCREP as well .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants