Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Investigate alternative methods for sharing job memory usage information #34084

Closed
davidkyle opened this issue Sep 26, 2018 · 5 comments
Closed
Assignees
Labels
>enhancement :ml Machine learning

Comments

@davidkyle
Copy link
Member

When there are multiple ml nodes in the cluster the job allocation decision is made based on the number of open jobs on each node and how much memory they use. Job memory usage is store in the job configuration and is updated periodically during the job's run when a model size stats doc is emitted by autodetect. This can lead to frequent job config updates (cluster state updates) particularly so for historical look-back jobs.

  1. Consider moving the job's established memory usage from the config as it is a result of the job running not part of it's setup.
  2. Consider alternative methods go gather the open job's memory usage and make that information trivially available to the code making the allocation decision.

This is pertinent to the job config migration project #32905 where the job's memory usage is not available in the cluster state during the allocation decision. A temporary work around was implemented in #33994 basing the decision on the job count rather than memory usage.

@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

@droberts195
Copy link
Contributor

I agree that since we can't access the job config document from within the allocation decision there's no longer any point storing established model memory in it, so it should be removed (tolerated by the job parser for BWC purposes until 8.0 though). I only put it in the job config in the first place because it needed to be available to the allocation decision and the job task status had a strict parser.

Core Elasticsearch has a similar problem for allocating shards to nodes. The master node needs an up-to-date view of how much disk space each data node currently has. This problem is solved by the InternalClusterInfoService class. I think we should add a similar class to the ML plugin, say MlClusterInfoService, keeping the name generic as one day we might want to collect something other than memory information. (We can't easily extend InternalClusterInfoService as it's not part of X-Pack.)

Once established model memory for each job is held in MlClusterInfoService we don't need to record it on any documents, as the definition is either analysis_limits.model_memory_limit or the model_size_bytes from the most recent model_size_stats document, and both these numbers can easily be obtained by anyone who wants to know them.

@droberts195
Copy link
Contributor

I'll have a go at implementing the idea from #34084 (comment)

@droberts195
Copy link
Contributor

I had a closer look at InternalClusterInfoService and we don't actually need our service to be as complex. InternalClusterInfoService has to make a request to every node periodically to get its latest disk space. But our ML task memory service doesn't actually need to communicate with every node. It can just periodically kick off async searches for the relevant model_size_stats documents from the master node.

The process for making sure the ML task memory service has a reasonable value for each active ML task can be:

  1. Each job registers its memory requirement with the ML task memory service on opening. There is already code that gets the established model memory on opening a job created in 6.1 or earlier and sends the value found to an UpdateJobAction. This can be changed to always run on opening any job and instead send the value to a new master node action that updates the ML task memory service.
  2. Periodically iterate all ML persistent tasks that require native memory and run the search that finds the latest memory requirement.

Then the TransportOpenJobAction.OpenJobPersistentTasksExecutor will have a reference to this new ML task memory service, so will be able to use the information in it to allocate the jobs.

@droberts195
Copy link
Contributor

Fixed by #36069

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement :ml Machine learning
Projects
None yet
Development

No branches or pull requests

3 participants