Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Use a new ML endpoint to estimate a model memory #60376

Merged
merged 18 commits into from
Mar 19, 2020

Conversation

darnautov
Copy link
Contributor

@darnautov darnautov commented Mar 17, 2020

Summary

Part of #60386.

  • Update calculate_model_memory_limit kibana endpoint to use the ML _estimate_model_memory API to retrieve a model memory estimation.
  • Update validate_model_memory_limit to work with new calculate_model_memory_limit
  • TS refactoring

Checklist

@elasticmachine
Copy link
Contributor

Pinging @elastic/ml-ui (:ml)

Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall structure looks fine to me.

I left a few comments about how the ES endpoint is being called. (Since I don't know anything about JavaScript or TypeScript someone else needs to review from that perspective.)

@darnautov darnautov requested a review from droberts195 March 18, 2020 10:27
Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates Dima. I cannot see anything wrong with this now, but please wait for a review from someone who knows the Kibana code well before merging.

indexPattern: string,
query: any,
timeFieldName: string,
earliestMs: number,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note that in the follow-up bit of work, where the modules setup endpoint calls this calculateModelMemoryLimit function, that we need to handle the case where earliest and latest times are not supplied. In this case, the setup endpoint should find the start/end times of the latest 3 months of data, and pass those times to this function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the heads-up, yes, I planned this changes for the follow-up PR dedicated to the setup endpoint

@peteharverson
Copy link
Contributor

I tested the multi metric wizard, with the time range set to one where there is no data in the index, and the call to the calculate_model_memory_limit endpoint is generating an HTTP 500 interval server error:

{"statusCode":500,"error":"Internal Server Error","message":"Unable to retrieve model memory estimation"}

This looks like it happens when you add an influencer which isn't the partitioning field.

@darnautov
Copy link
Contributor Author

I tested the multi metric wizard, with the time range set to one where there is no data in the index, and the call to the calculate_model_memory_limit endpoint is generating an HTTP 500 interval server error:

{"statusCode":500,"error":"Internal Server Error","message":"Unable to retrieve model memory estimation"}

This looks like it happens when you add an influencer which isn't the partitioning field.

good catch, fixed in 7bdf515

@jgowdyelastic
Copy link
Member

I'm seeing errors when trying to calculate the mml when i have a max mml set in the es config.
i have set it to 128mb:

xpack.ml.max_model_memory_limit: 128mb

@darnautov
Copy link
Contributor Author

I'm seeing errors when trying to calculate the mml when i have a max mml set in the es config.
i have set it to 128mb:

xpack.ml.max_model_memory_limit: 128mb

Thanks for checking this. I used ts-ignore to silent a type issue from numeral and it ignored the actual error left after refactoring. Fixed in a85a4d6

Copy link
Member

@jgowdyelastic jgowdyelastic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

* Service for carrying out queries to obtain data
* specific to fields in Elasticsearch indices.
*/
export function fieldsServiceProvider(callAsCurrentUser: APICaller) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Building a multi-metric job against the gallery data set, using the full time range, and a small bucket span, I am seeing circuit_breaking_exceptions in the network tab of the dev console as I add extra influencers.

{"statusCode":429,"error":"Too Many Requests","message":"[circuit_breaking_exception] [parent] Data too large, data for [<reused_arrays>] would be [988102840/942.3mb], which is larger than the limit of [986932838/941.2mb], real usage: [988102768/942.3mb], new bytes reserved: [72/72b], usages [request=147475312/140.6mb, fielddata=228074/222.7kb, accounting=1282468/1.2mb, inflight_requests=784/784b], with { bytes_wanted=988102840 & bytes_limit=986932838 & durability=\"TRANSIENT\" } (and) [circuit_breaking_exception]...

We need to try and optimize the queries being done in here. There are probably two factors that are making it use a lot of memory:

  1. It’s doing all the fields in one query
  2. It’s using the full time range

If the bucket span is small and the time range long, then that’s a lot of time buckets e.g. 15 minute bucket span, 12 month time range.

So we probably need to be pragmatic about capping the number of buckets that we take the max over. Say last 1000 buckets. That should be easy to calculate knowing the bucket span and time range provided. Just adjust earliestMs to be max(earliestMs, latestMs - 1000 * interval) (obviously needs to be converted to ms).

As well as being defensive in the queries, you should probably also expose to the user that an error has occurred calling the estimate model memory limit endpoint e.g. in a toast notification. Currently you only see the error in the Kibana server console, or by looking in the browser network tab.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for testing! For now, I added capping of the time range as well as replace must part of the query with filter to take advantage of caching, check 7ee91e7.
In a follow-up PR, I'll try to split the current ES query into multiple (in case there are plenty of influencers provided) and check how it performs.

@darnautov darnautov requested a review from peteharverson March 19, 2020 12:45
@darnautov
Copy link
Contributor Author

@elasticmachine merge upstream

Copy link
Contributor

@peteharverson peteharverson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Capping to a max of 1000 buckets has cleared the circuit_breaker_exceptions I was seeing on the gallery data before. Would be good to expose any errors from the endpoint to the user in the follow-up.

@kibanamachine
Copy link
Contributor

💚 Build Succeeded

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@darnautov darnautov merged commit 7aa4651 into elastic:master Mar 19, 2020
@darnautov darnautov deleted the ML-48510-mml-enhancements branch March 19, 2020 15:45
darnautov added a commit to darnautov/kibana that referenced this pull request Mar 19, 2020
* [ML] refactor calculate_model_memory_limit route, use estimateModelMemory endpoint

* [ML] refactor validate_model_memory_limit, migrate tests to jest

* [ML] fix typing issue

* [ML] start estimateModelMemory url with /

* [ML] fix typo, filter mlcategory

* [ML] extract getCardinalities function

* [ML] fields_service.ts

* [ML] wip getMaxBucketCardinality

* [ML] refactor and comments

* [ML] fix aggs keys with special characters, fix integration tests

* [ML] use pre-defined job types

* [ML] fallback to 0 in case max bucket cardinality receives null

* [ML] calculateModelMemoryLimit on influencers change

* [ML] fix maxModelMemoryLimit

* [ML] cap aggregation to max 1000 buckets

* [ML] rename intervalDuration
darnautov added a commit that referenced this pull request Mar 20, 2020
* [ML] refactor calculate_model_memory_limit route, use estimateModelMemory endpoint

* [ML] refactor validate_model_memory_limit, migrate tests to jest

* [ML] fix typing issue

* [ML] start estimateModelMemory url with /

* [ML] fix typo, filter mlcategory

* [ML] extract getCardinalities function

* [ML] fields_service.ts

* [ML] wip getMaxBucketCardinality

* [ML] refactor and comments

* [ML] fix aggs keys with special characters, fix integration tests

* [ML] use pre-defined job types

* [ML] fallback to 0 in case max bucket cardinality receives null

* [ML] calculateModelMemoryLimit on influencers change

* [ML] fix maxModelMemoryLimit

* [ML] cap aggregation to max 1000 buckets

* [ML] rename intervalDuration

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
gmmorris added a commit to gmmorris/kibana that referenced this pull request Mar 20, 2020
* master:
  [ML] Use a new ML endpoint to estimate a model memory (elastic#60376)
  [Logs UI] Correctly update the expanded log rate table rows (elastic#60306)
  fixes drag and drop flakiness (elastic#60625)
  Removing isEmptyState from embeddable input (elastic#60511)
  [Cross Cluster Replication] NP Shim (elastic#60121)
  Clear changes when canceling an edit to an alert (elastic#60518)
  Update workflow syntax (elastic#60626)
  Updating project assigner workflows to v2.0.0 of the action and back to default tokens (elastic#60577)
  migrate saved objects management edition view to react/typescript/eui (elastic#59490)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants