-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Use a new ML endpoint to estimate a model memory #60376
[ML] Use a new ML endpoint to estimate a model memory #60376
Conversation
Pinging @elastic/ml-ui (:ml) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The overall structure looks fine to me.
I left a few comments about how the ES endpoint is being called. (Since I don't know anything about JavaScript or TypeScript someone else needs to review from that perspective.)
x-pack/plugins/ml/server/models/calculate_model_memory_limit/calculate_model_memory_limit.ts
Outdated
Show resolved
Hide resolved
x-pack/plugins/ml/server/models/calculate_model_memory_limit/calculate_model_memory_limit.ts
Outdated
Show resolved
Hide resolved
x-pack/plugins/ml/server/models/calculate_model_memory_limit/calculate_model_memory_limit.ts
Outdated
Show resolved
Hide resolved
x-pack/plugins/ml/server/models/calculate_model_memory_limit/calculate_model_memory_limit.ts
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the updates Dima. I cannot see anything wrong with this now, but please wait for a review from someone who knows the Kibana code well before merging.
x-pack/plugins/ml/server/routes/schemas/anomaly_detectors_schema.ts
Outdated
Show resolved
Hide resolved
x-pack/plugins/ml/server/models/job_validation/validate_model_memory_limit.test.ts
Outdated
Show resolved
Hide resolved
x-pack/plugins/ml/server/models/job_validation/validate_model_memory_limit.ts
Outdated
Show resolved
Hide resolved
indexPattern: string, | ||
query: any, | ||
timeFieldName: string, | ||
earliestMs: number, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a note that in the follow-up bit of work, where the modules setup
endpoint calls this calculateModelMemoryLimit
function, that we need to handle the case where earliest and latest times are not supplied. In this case, the setup
endpoint should find the start/end times of the latest 3 months of data, and pass those times to this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the heads-up, yes, I planned this changes for the follow-up PR dedicated to the setup endpoint
I tested the multi metric wizard, with the time range set to one where there is no data in the index, and the call to the
This looks like it happens when you add an influencer which isn't the partitioning field. |
good catch, fixed in 7bdf515 |
I'm seeing errors when trying to calculate the mml when i have a max mml set in the es config.
|
Thanks for checking this. I used |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
* Service for carrying out queries to obtain data | ||
* specific to fields in Elasticsearch indices. | ||
*/ | ||
export function fieldsServiceProvider(callAsCurrentUser: APICaller) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Building a multi-metric job against the gallery data set, using the full time range, and a small bucket span, I am seeing circuit_breaking_exceptions
in the network tab of the dev console as I add extra influencers.
{"statusCode":429,"error":"Too Many Requests","message":"[circuit_breaking_exception] [parent] Data too large, data for [<reused_arrays>] would be [988102840/942.3mb], which is larger than the limit of [986932838/941.2mb], real usage: [988102768/942.3mb], new bytes reserved: [72/72b], usages [request=147475312/140.6mb, fielddata=228074/222.7kb, accounting=1282468/1.2mb, inflight_requests=784/784b], with { bytes_wanted=988102840 & bytes_limit=986932838 & durability=\"TRANSIENT\" } (and) [circuit_breaking_exception]...
We need to try and optimize the queries being done in here. There are probably two factors that are making it use a lot of memory:
- It’s doing all the fields in one query
- It’s using the full time range
If the bucket span is small and the time range long, then that’s a lot of time buckets e.g. 15 minute bucket span, 12 month time range.
So we probably need to be pragmatic about capping the number of buckets that we take the max over. Say last 1000 buckets. That should be easy to calculate knowing the bucket span and time range provided. Just adjust earliestMs
to be max(earliestMs, latestMs - 1000 * interval)
(obviously needs to be converted to ms).
As well as being defensive in the queries, you should probably also expose to the user that an error has occurred calling the estimate model memory limit endpoint e.g. in a toast notification. Currently you only see the error in the Kibana server console, or by looking in the browser network tab.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for testing! For now, I added capping of the time range as well as replace must
part of the query with filter
to take advantage of caching, check 7ee91e7.
In a follow-up PR, I'll try to split the current ES query into multiple (in case there are plenty of influencers provided) and check how it performs.
@elasticmachine merge upstream |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Capping to a max of 1000 buckets has cleared the circuit_breaker_exceptions
I was seeing on the gallery data before. Would be good to expose any errors from the endpoint to the user in the follow-up.
💚 Build SucceededHistory
To update your PR or re-run it, just comment with: |
* [ML] refactor calculate_model_memory_limit route, use estimateModelMemory endpoint * [ML] refactor validate_model_memory_limit, migrate tests to jest * [ML] fix typing issue * [ML] start estimateModelMemory url with / * [ML] fix typo, filter mlcategory * [ML] extract getCardinalities function * [ML] fields_service.ts * [ML] wip getMaxBucketCardinality * [ML] refactor and comments * [ML] fix aggs keys with special characters, fix integration tests * [ML] use pre-defined job types * [ML] fallback to 0 in case max bucket cardinality receives null * [ML] calculateModelMemoryLimit on influencers change * [ML] fix maxModelMemoryLimit * [ML] cap aggregation to max 1000 buckets * [ML] rename intervalDuration
* [ML] refactor calculate_model_memory_limit route, use estimateModelMemory endpoint * [ML] refactor validate_model_memory_limit, migrate tests to jest * [ML] fix typing issue * [ML] start estimateModelMemory url with / * [ML] fix typo, filter mlcategory * [ML] extract getCardinalities function * [ML] fields_service.ts * [ML] wip getMaxBucketCardinality * [ML] refactor and comments * [ML] fix aggs keys with special characters, fix integration tests * [ML] use pre-defined job types * [ML] fallback to 0 in case max bucket cardinality receives null * [ML] calculateModelMemoryLimit on influencers change * [ML] fix maxModelMemoryLimit * [ML] cap aggregation to max 1000 buckets * [ML] rename intervalDuration Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
* master: [ML] Use a new ML endpoint to estimate a model memory (elastic#60376) [Logs UI] Correctly update the expanded log rate table rows (elastic#60306) fixes drag and drop flakiness (elastic#60625) Removing isEmptyState from embeddable input (elastic#60511) [Cross Cluster Replication] NP Shim (elastic#60121) Clear changes when canceling an edit to an alert (elastic#60518) Update workflow syntax (elastic#60626) Updating project assigner workflows to v2.0.0 of the action and back to default tokens (elastic#60577) migrate saved objects management edition view to react/typescript/eui (elastic#59490)
Summary
Part of #60386.
calculate_model_memory_limit
kibana endpoint to use the ML_estimate_model_memory
API to retrieve a model memory estimation.validate_model_memory_limit
to work with newcalculate_model_memory_limit
Checklist