-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] prefer least allocated model when a new node is added to the cluster #77756
[ML] prefer least allocated model when a new node is added to the cluster #77756
Conversation
Pinging @elastic/ml-core (Team:ML) |
.entrySet() | ||
.stream() | ||
.filter(entry -> entry.getValue().getAllocationState().equals(AllocationState.STOPPING) == false) | ||
.sorted(Comparator.comparing(e -> e.getValue().getNodeRoutingTable().size())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is obviously not optimal, but I am not 100% sure how else to do it.
One issue here (besides performance), is that we always use the counts based on the original cluster state. Meaning, as models are allocated, our iteration order isn't updated in the same pass.
This is probably acceptable.
Another option is that we maybe iterate on nodes in the outer loop? This way all allocations attempt the same node and then we move on? The only downside here is that the order of inner iteration may change all the time.
Another option is attempting this same thing with a priority queue that updates on each iteration, popping off all the least allocations and rebuilding each time, but this seems pretty expensive to me as the whole heap would be rebuit on every pass on a new node...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A low hanging fruit here is to apply the filter of StartTrainedModelDeploymentAction.TaskParams.mayAllocateToNode
at the point where we collect currentNotShuttingDownNodes
. So that would become currentEligibleNodes
for instance and it'd be non-shutting down, ML nodes of an appropriate version. This would significantly reduce one of the parameters here for most clusters.
Not sure I can think of something else from a first look. But if we make this optimization, then we can run a few tests to see how long it takes to run this with realistic parameter values. It might not be a problem at all.
@@ -289,16 +289,8 @@ static ClusterState updateModelRoutingTable(ClusterState currentState, UpdateTra | |||
logger.trace( | |||
() -> new ParameterizedMessage("[{}] [{}] current metadata before update {}", modelId, nodeId, Strings.toString(metadata)) | |||
); | |||
Set<String> shuttingDownNodes = nodesShuttingDown(currentState); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is old, unused code, decided to clean it up. Unrelated to the change in this PR.
StartTrainedModelDeploymentAction.TaskParams params, | ||
DiscoveryNode node | ||
) { | ||
NodeLoad load = builder.isChanged() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This behaviour might be surprising or cause a subtle bug as if you pass the builder (this is an overloaded method) you expect it to be used. I can't think of reason not to call the builder.build()
did you have something in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't think of reason not to call the builder.build() did you have something in mind?
It just seems unnecessary if nothing changed. The object could be pretty big and constructing a brand new object (deeply copying all the hashtables), could be expensive in the case when a node is simply leaving the cluster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It works fine in this case because we know it is ok to fall back on the TrainedModelAllocationMetadata
in ClusterState state
but given a few refactors and a little bit of time that assumption may no longer hold and lead unexpected behaviour.
Perhaps it is not as neat but you could test builder.isChanged()
outside this function and then call the correct overload or at least add a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you pass the builder (this is an overloaded method) you expect it to be used
I suppose it is a sneaky switch. I will update it to be more clear so other callers of the method aren't surprised.
&& modelAllocationEntry.getValue().isRoutedToNode(node.getId()) == false) { | ||
Optional<String> failure = nodeHasCapacity( | ||
currentState, | ||
builder, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch using the updated allocations here
…tion-prefers-least-allocated-on-new-node
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
run elasticsearch-ci/packaging-tests-windows-sample |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM2
* master: (185 commits) Implement get and containsKey in terms of the wrapped innerMap (elastic#77965) Adjust Lucene version and enable BWC tests (elastic#77933) Disable BWC to upgrade to Lucene-8.10-snapshot Reenable MlDistributedFailureIT [DOCS] Fix typo for `script.painless.regex.enabled` setting value (elastic#77853) Upgrade to Lucene-8.10.0-snapshot-bf2fcb53079 (elastic#77801) [DOCS] Fix ESS install lead-in (elastic#77887) Resolve thirdparty gradle plugin artifacts from mavencentral (elastic#77865) Reduce the number of times that `LifecycleExecutionState` is parsed when running a policy. (elastic#77863) Utility methods to add and remove backing indices from data streams (elastic#77778) Use Objects.equals() instead of == to compare strings (elastic#77840) [ML] prefer least allocated model when a new node is added to the cluster (elastic#77756) Deprecate ignore_throttled parameter (elastic#77479) Improve LifecycleExecutionState parsing. (elastic#77855) [DOCS] Removes deprecated word from HLRC title. (elastic#77851) Remove legacy geo code from AggregationResultUtils (elastic#77702) Adjust SearchableSnapshotsBlobStoreCacheIntegTests.testBlobStoreCache (elastic#77758) Laxify SecureSM to allow creation of the JDK's innocuous threads (elastic#77789) [Test] Reduce concurrency when testing creation of security index (elastic#75293) Refactor metric PipelineAggregation integration test (elastic#77548) ... # Conflicts: # server/src/main/java/org/elasticsearch/cluster/metadata/DataStream.java
When a new node is added to the cluster, we should first attempt to allocate models that have fewer
allocated nodes.
This way we prioritize getting allocations up and running vs. using a random selection.