Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] prefer least allocated model when a new node is added to the cluster #77756

Conversation

benwtrent
Copy link
Member

When a new node is added to the cluster, we should first attempt to allocate models that have fewer
allocated nodes.

This way we prioritize getting allocations up and running vs. using a random selection.

@elasticmachine elasticmachine added the Team:ML Meta label for the ML team label Sep 15, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

.entrySet()
.stream()
.filter(entry -> entry.getValue().getAllocationState().equals(AllocationState.STOPPING) == false)
.sorted(Comparator.comparing(e -> e.getValue().getNodeRoutingTable().size()))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is obviously not optimal, but I am not 100% sure how else to do it.

One issue here (besides performance), is that we always use the counts based on the original cluster state. Meaning, as models are allocated, our iteration order isn't updated in the same pass.

This is probably acceptable.

Another option is that we maybe iterate on nodes in the outer loop? This way all allocations attempt the same node and then we move on? The only downside here is that the order of inner iteration may change all the time.

Another option is attempting this same thing with a priority queue that updates on each iteration, popping off all the least allocations and rebuilding each time, but this seems pretty expensive to me as the whole heap would be rebuit on every pass on a new node...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A low hanging fruit here is to apply the filter of StartTrainedModelDeploymentAction.TaskParams.mayAllocateToNode at the point where we collect currentNotShuttingDownNodes. So that would become currentEligibleNodes for instance and it'd be non-shutting down, ML nodes of an appropriate version. This would significantly reduce one of the parameters here for most clusters.

Not sure I can think of something else from a first look. But if we make this optimization, then we can run a few tests to see how long it takes to run this with realistic parameter values. It might not be a problem at all.

@@ -289,16 +289,8 @@ static ClusterState updateModelRoutingTable(ClusterState currentState, UpdateTra
logger.trace(
() -> new ParameterizedMessage("[{}] [{}] current metadata before update {}", modelId, nodeId, Strings.toString(metadata))
);
Set<String> shuttingDownNodes = nodesShuttingDown(currentState);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is old, unused code, decided to clean it up. Unrelated to the change in this PR.

StartTrainedModelDeploymentAction.TaskParams params,
DiscoveryNode node
) {
NodeLoad load = builder.isChanged()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behaviour might be surprising or cause a subtle bug as if you pass the builder (this is an overloaded method) you expect it to be used. I can't think of reason not to call the builder.build() did you have something in mind?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't think of reason not to call the builder.build() did you have something in mind?

It just seems unnecessary if nothing changed. The object could be pretty big and constructing a brand new object (deeply copying all the hashtables), could be expensive in the case when a node is simply leaving the cluster

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It works fine in this case because we know it is ok to fall back on the TrainedModelAllocationMetadata in ClusterState state but given a few refactors and a little bit of time that assumption may no longer hold and lead unexpected behaviour.

Perhaps it is not as neat but you could test builder.isChanged() outside this function and then call the correct overload or at least add a comment

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you pass the builder (this is an overloaded method) you expect it to be used

I suppose it is a sneaky switch. I will update it to be more clear so other callers of the method aren't surprised.

&& modelAllocationEntry.getValue().isRoutedToNode(node.getId()) == false) {
Optional<String> failure = nodeHasCapacity(
currentState,
builder,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch using the updated allocations here

Copy link
Member

@davidkyle davidkyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@benwtrent
Copy link
Member Author

run elasticsearch-ci/packaging-tests-windows-sample

Copy link
Contributor

@dimitris-athanasiou dimitris-athanasiou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM2

@benwtrent benwtrent added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Sep 16, 2021
@benwtrent benwtrent merged commit fec5e8a into elastic:master Sep 16, 2021
@benwtrent benwtrent deleted the feature/ml-allocation-prefers-least-allocated-on-new-node branch September 16, 2021 12:48
wjp719 added a commit to wjp719/elasticsearch that referenced this pull request Sep 18, 2021
* master: (185 commits)
  Implement get and containsKey in terms of the wrapped innerMap (elastic#77965)
  Adjust Lucene version and enable BWC tests (elastic#77933)
  Disable BWC to upgrade to Lucene-8.10-snapshot
  Reenable MlDistributedFailureIT
  [DOCS] Fix typo for `script.painless.regex.enabled` setting value (elastic#77853)
  Upgrade to Lucene-8.10.0-snapshot-bf2fcb53079 (elastic#77801)
  [DOCS] Fix ESS install lead-in (elastic#77887)
  Resolve thirdparty gradle plugin artifacts from mavencentral (elastic#77865)
  Reduce the number of times that `LifecycleExecutionState` is parsed when running a policy. (elastic#77863)
  Utility methods to add and remove backing indices from data streams (elastic#77778)
  Use Objects.equals() instead of == to compare strings (elastic#77840)
  [ML] prefer least allocated model when a new node is added to the cluster (elastic#77756)
  Deprecate ignore_throttled parameter (elastic#77479)
  Improve LifecycleExecutionState parsing. (elastic#77855)
  [DOCS] Removes deprecated word from HLRC title. (elastic#77851)
  Remove legacy geo code from AggregationResultUtils (elastic#77702)
  Adjust SearchableSnapshotsBlobStoreCacheIntegTests.testBlobStoreCache (elastic#77758)
  Laxify SecureSM to allow creation of the JDK's innocuous threads (elastic#77789)
  [Test] Reduce concurrency when testing creation of security index (elastic#75293)
  Refactor metric PipelineAggregation integration test (elastic#77548)
  ...

# Conflicts:
#	server/src/main/java/org/elasticsearch/cluster/metadata/DataStream.java
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) :ml Machine learning >non-issue Team:ML Meta label for the ML team v8.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants