Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Platform specific models #99584

Merged
merged 94 commits into from
Sep 28, 2023

Conversation

maxhniebergall
Copy link
Member

Adding support for platform specific models

@maxhniebergall maxhniebergall added cloud-deploy Publish cloud docker image for Cloud-First-Testing >enhancement :ml Machine learning labels Sep 14, 2023
@maxhniebergall maxhniebergall self-assigned this Sep 14, 2023
@elasticsearchmachine
Copy link
Collaborator

Hi @maxhniebergall, I've created a changelog YAML for you.

maxhniebergall and others added 2 commits September 19, 2023 10:53
A few bug fixes from Dave R

Co-authored-by: David Roberts <dave.roberts@elastic.co>
@maxhniebergall
Copy link
Member Author

The failure in elasticsearch-ci/part-2 seems to be unrelated. I was able to reproduce it on my local machine with the reproduce command, but not without the particular parameters. I will create an issue for this.

REPRODUCE WITH: ./gradlew ':x-pack:plugin:esql:compute:test' --tests "org.elasticsearch.compute.operator.ProjectOperatorTests.testProjection" -Dtests.seed=D7AC53920B72C687 -Dtests.locale=sk -Dtests.timezone=Europe/Samara -Druntime.java=21	

org.elasticsearch.compute.operator.ProjectOperatorTests > testProjection FAILED	
    java.lang.AssertionError: java.lang.IllegalStateException: can't release already released block [IntVectorBlock[vector=ConstantIntVector[positions=5, value=3]]]

@maxhniebergall
Copy link
Member Author

https://gradle-enterprise.elastic.co/s/arrn2n6mkbcgc

In the stack trace:
1> java.lang.IllegalStateException: Future got interrupted
org.elasticsearch.xpack.ml.inference.assignment.TrainedModelAssignmentNodeService.loadQueuedModels(TrainedModelAssignmentNodeService.java:210) ~[main/:?]

on that line:

deploymentManager.startDeployment(loadingTask, listener);                
TrainedModelDeploymentTask deployedTask = listener.actionGet();

due to
2> org.elasticsearch.ElasticsearchStatusException: Starting deployment timed out after [30s]

…ously

Co-authored-by: David Roberts <dave.roberts@elastic.co>
Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Now this is passing CI let's get it merged unless somebody else can see something really bad.

Any nits can be resolved in a followup PR.

One such followup should be removing the hack the looks at the model name to determine if it's Linux x86.

@maxhniebergall
Copy link
Member Author

I confirmed with manual testing that the headerwarning shows up on put trained model, and the backend refuses to start platform specific models.

@maxhniebergall maxhniebergall merged commit 7c21ce3 into elastic:main Sep 28, 2023
piergm pushed a commit to piergm/elasticsearch that referenced this pull request Oct 2, 2023
* Added platform architecture field to TrainedModelMetadata and users of TrainedModelMetadata

* Added TransportVersions guarding for TrainedModelMetadata

* Prevent platform-specific models from being deployed on the wrong architecture

* Added logic to only verify node architectures for models which are platform specific

* Handle null platform architecture

* Added logging for the detection of heterogeneous platform architectures among ML nodes and refactoring to support this

* Added platform architecture field to TrainedModelConfig

* Stop platform-speficic model when rebalance occurs and the cluster has a heterogeneous architecture among ML nodes

* Added logic to TransportPutTrainedModelAction to return a warning response header when the model is paltform-specific and cannot be depoloyed on the cluster at that time due to heterogenous architectures among ML nodes

* Added MlPlatformArchitecturesUtilTests

* Updated Create Trained Models API docs to describe the new platform_architecture optional field.

* Updated/incremented InferenceIndexConstants

* Added special override to make  models with linux-x86_64 in the model ID to be platform specific
jakelandis pushed a commit to jakelandis/elasticsearch that referenced this pull request Oct 2, 2023
* Added platform architecture field to TrainedModelMetadata and users of TrainedModelMetadata

* Added TransportVersions guarding for TrainedModelMetadata

* Prevent platform-specific models from being deployed on the wrong architecture

* Added logic to only verify node architectures for models which are platform specific

* Handle null platform architecture

* Added logging for the detection of heterogeneous platform architectures among ML nodes and refactoring to support this

* Added platform architecture field to TrainedModelConfig

* Stop platform-speficic model when rebalance occurs and the cluster has a heterogeneous architecture among ML nodes

* Added logic to TransportPutTrainedModelAction to return a warning response header when the model is paltform-specific and cannot be depoloyed on the cluster at that time due to heterogenous architectures among ML nodes

* Added MlPlatformArchitecturesUtilTests

* Updated Create Trained Models API docs to describe the new platform_architecture optional field.

* Updated/incremented InferenceIndexConstants

* Added special override to make  models with linux-x86_64 in the model ID to be platform specific
darnautov added a commit to elastic/kibana that referenced this pull request Oct 3, 2023
## Summary

Adds support for ELSER v2 download from the Trained Models UI.

- Marks an appropriate model version for the current cluster
configuration with the recommended flag.
- Updates the state column with better human-readable labels and colour
indicators.
- Adds a callout promoting a new version of ELSER

<img width="1686" alt="image"
src="https://github.com/elastic/kibana/assets/5236598/0deea53a-6d37-4af6-97bc-9f46e36f113b">

#### Notes for reviews
- We need to wait for
elastic/elasticsearch#99584 to get the start
deployment validation functionality. At the moment you can successfully
start deployment of the wrong model version.

### Checklist

- [x] Any text added follows [EUI's writing
guidelines](https://elastic.github.io/eui/#/guidelines/writing), uses
sentence case text and includes [i18n
support](https://github.com/elastic/kibana/blob/main/packages/kbn-i18n/README.md)
- [ ]
[Documentation](https://www.elastic.co/guide/en/kibana/master/development-documentation.html)
was added for features that require explanation or tutorials
- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios
- [x] Any UI touched in this PR is usable by keyboard only (learn more
about [keyboard accessibility](https://webaim.org/techniques/keyboard/))
- [x] Any UI touched in this PR does not create any new axe failures
(run axe in browser:
[FF](https://addons.mozilla.org/en-US/firefox/addon/axe-devtools/),
[Chrome](https://chrome.google.com/webstore/detail/axe-web-accessibility-tes/lhdoppojpmngadmnindnejefpokejbdd?hl=
- [x] This renders correctly on smaller devices using a responsive
layout. (You can test this [in your
browser](https://www.browserstack.com/guide/responsive-testing-on-local-server))
- [x] This was checked for [cross-browser
compatibility](https://www.elastic.co/support/matrix#matrix_browsers)
droberts195 added a commit to droberts195/elasticsearch that referenced this pull request Oct 3, 2023
Adds the new platform_architecture field from elastic#99584
to the package config used when downloading Elastic
models from GCS.
droberts195 added a commit that referenced this pull request Oct 3, 2023
Adds the new platform_architecture field from #99584
to the package config used when downloading Elastic
models from GCS.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cloud-deploy Publish cloud docker image for Cloud-First-Testing >enhancement :ml Machine learning Team:ML Meta label for the ML team v8.11.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants