Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Rate limit established model memory updates #31768

Merged

Conversation

droberts195
Copy link
Contributor

There is at most one model size stats document per bucket, but
during lookback a job can churn through many buckets very quickly.
This can lead to many cluster state updates if established model
memory needs to be updated for a given model size stats document.

This change rate limits established model memory updates to one
per job per 5 seconds. This is done by scheduling the updates 5
seconds in the future, but replacing the value to be written if
another model size stats document is received during the waiting
period. Updating the values in arrears like this means that the
last value received will be the one associated with the job in the
long term, whereas alternative approaches such as not updating the
value if a new value was close to the old value would not.

There is at most one model size stats document per bucket, but
during lookback a job can churn through many buckets very quickly.
This can lead to many cluster state updates if established model
memory needs to be updated for a given model size stats document.

This change rate limits established model memory updates to one
per job per 5 seconds.  This is done by scheduling the updates 5
seconds in the future, but replacing the value to be written if
another model size stats document is received during the waiting
period.  Updating the values in arrears like this means that the
last value received will be the one associated with the job in the
long term, whereas alternative approaches such as not updating the
value if a new value was close to the old value would not.
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

Copy link
Contributor

@dimitris-athanasiou dimitris-athanasiou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGMT. I think we should plan refactoring AutodetectResultProcessor as it has grown with a few different responsibilities now. But we should do that as a next step rather in this PR.

@droberts195 droberts195 merged commit 308e37f into elastic:master Jul 4, 2018
@droberts195 droberts195 deleted the delay_est_model_memory_updates branch July 4, 2018 12:56
dnhatn added a commit that referenced this pull request Jul 4, 2018
* master:
  [ML] Rate limit established model memory updates (#31768)
  [Docs] Correct default window_size (#31582)
  S3 fixture should report 404 on unknown bucket (#31782)
  Detach Transport from TransportService (#31727)
  [ML] Limit ML filter items to 10K (#31731)
  [ML] Return statistics about forecasts as part of the jobsstats and usage API (#31647)
  Fixture for Minio testing (#31688)
  [DOCS] Add missing get mappings docs to HLRC (#31765)
  [DOCS] Starting Elasticsearch (#31701)
  Painless: Complete Removal of Painless Type (#31699)
  Fix not waiting for Netty ThreadDeathWatcher in IT (#31758)
  Consolidate watcher setting update registration (#31762)
  Build: re-enabled bwc (#31769)
  ingest: Introduction of a bytes processor (#31733)
  Fix coerce validation_method in GeoBoundingBoxQueryBuilder (#31747)
  Add analyze API to high-level rest client (#31577)
  [DOCS] Typos
  DOC: Add examples to the SQL docs (#31633)
  Add support for AWS session tokens (#30414)
  Watcher: Reenable start/stop yaml tests (#31754)
  Implemented XContent serialisation for GetIndexResponse (#31675)
  JDBC: Fix stackoverflow on getObject and timestamp conversion (#31735)
  resolveHasher defaults to NOOP (#31723)
  Account for XContent overhead in in-flight breaker
  Split CircuitBreaker-related tests (#31659)
  Add write*Blob option to replace existing blob (#31729)
  Painless: Add Context Docs (#31190)
  Watcher: Fix chain input toXcontent serialization (#31721)
  Docs: Match the examples in the description (#31710)
  rest-high-level: added get cluster settings (#31706)
  [Docs] Correct typos (#31720)
  Clean up double semicolon code typos (#31687)
  [DOCS] Check for Windows and *nix file paths (#31648)
  [ML] Validate ML filter_id (#31535)
  Revert long lines
  Fix TransportChangePasswordActionTests
droberts195 added a commit that referenced this pull request Jul 4, 2018
There is at most one model size stats document per bucket, but
during lookback a job can churn through many buckets very quickly.
This can lead to many cluster state updates if established model
memory needs to be updated for a given model size stats document.

This change rate limits established model memory updates to one
per job per 5 seconds.  This is done by scheduling the updates 5
seconds in the future, but replacing the value to be written if
another model size stats document is received during the waiting
period.  Updating the values in arrears like this means that the
last value received will be the one associated with the job in the
long term, whereas alternative approaches such as not updating the
value if a new value was close to the old value would not.
droberts195 added a commit that referenced this pull request Jul 4, 2018
There is at most one model size stats document per bucket, but
during lookback a job can churn through many buckets very quickly.
This can lead to many cluster state updates if established model
memory needs to be updated for a given model size stats document.

This change rate limits established model memory updates to one
per job per 5 seconds.  This is done by scheduling the updates 5
seconds in the future, but replacing the value to be written if
another model size stats document is received during the waiting
period.  Updating the values in arrears like this means that the
last value received will be the one associated with the job in the
long term, whereas alternative approaches such as not updating the
value if a new value was close to the old value would not.
dnhatn added a commit that referenced this pull request Jul 5, 2018
* 6.x:
  Test: Do not remove xpack templates when cleaning (#31642)
  SQL: Allow long literals (#31777)
  SQL: Fix incorrect message for aliases (#31792)
  Detach Transport from TransportService (#31727)
  6.3.1 release notes (#31829)
  Add unreleased version 6.3.2
  [ML][TEST] Use java 11 valid time format in DataDescriptionTests (#31817)
  [ML] Don't treat stale FAILED jobs as OPENING in job allocation (#31800)
  [ML] Fix calendar and filter updates from non-master nodes (#31804)
  Fix license header generation on Windows (#31790)
  mark XPackRestIT.test {p0=monitoring/bulk/10_basic/Bulk indexing of monitoring data} as AwaitsFix
  Add JDK11 support without enabling in CI (#31644)
  Watcher: Fix check for currently executed watches (#31137)
  [DOCS] Fixes 6.3.0 release notes (#31771)
  Watcher: Ensure correct method is used to read secure settings (#31753)
  [ML] Rate limit established model memory updates (#31768)
  SQL: Update CLI logo
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants