Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] provide tmp storage for forecasting and possibly any ml native jobs #30399

Closed

Conversation

hendrikmuhs
Copy link
Contributor

X-pack part of elastic/ml-cpp#89

This implementation lazily (on 1st forecast request) checks for available diskspace and creates a subfolder for storing data outside of Lucene indexes, but as part of the ES data paths.

Details:

  • tmp storage is managed and does not allow allocation if diskspace is below a threshold (5GB at the moment)
  • tmp storage is supposed to be managed by the native component but in case this fails cleanup is provided:
    • on job close
    • on process crash
    • after node crash, on restart
  • available space is re-checked for every forecast call (the native component has to check again before writing)

Note:

  • the 1st path that has enough space is choosen
  • path can not switch once it is choosen (job close/reopen triggers a new search)

@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

@hendrikmuhs
Copy link
Contributor Author

FWIW: I manually tested this change again, just to be sure.

I will rebase the change after the review.

Copy link
Contributor

@edsavage edsavage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Just a couple of typos spotted in passing

try {
removeTmpStorage(jobTask.getJobId());
} catch (IOException e) {
logger.error(new ParameterizedMessage("[{}]Failed to delete temporay files", jobTask.getJobId()), e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: temporary

try {
removeTmpStorage(jobId);
} catch (IOException e) {
logger.error(new ParameterizedMessage("[{}]Failed to delete temporay files", jobId), e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: temporary

Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a couple of minor comments.

Also, is there a plan to test this end-to-end somewhere. I can see that, due to the size of data required to actually use the temp storage, it's probably best not done as part of the tests in the elasticsearch repo. But is there a plan to exercise the functionality using one of the larger data sets in the ML QA repo?

/**
* Provide storage for native components.
*/
@SuppressForbidden(reason = "provides storage to external components")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly is forbidden that this class uses? Is it possible to scope this annotation to just 1 or 2 methods rather than the whole class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I remember it had something todo with the use of IOUtils, but as you pointed out below, a lot has changed since the 1st implementation. I will re-visit.

package org.elasticsearch.xpack.ml.job.process;

import org.apache.logging.log4j.Logger;
import org.apache.lucene.util.IOUtils;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be org.elasticsearch.core.internal.io.IOUtils now. Since this work started there has been another PR that switched from Lucene's IO utils to an equivalent Elastic class.

Copy link
Contributor

@dimitris-athanasiou dimitris-athanasiou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, a few issues to address.

@@ -156,6 +156,9 @@ synchronous and predictable. Also the trigger engine thread is only started on
data nodes. And the Execute Watch API can be triggered regardless is watcher is
started or stopped. ({pull}30118[#30118])

Forecasting of Machine Learning job time series is now supported for large jobs
by temporary storing model state on disk. ({pull}30399[#30399])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: temporary -> temporarily

public void onNodeStartup() {
try {
nativeStorageProvider.cleanupLocalTmpStorageInCaseOfUncleanShutdown();
} catch (IOException e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth catching Exception here as a runtime exception creeping in in the future might prevent the node from starting and it's probably not a good reason to do so.

/**
* Provide storage for native components.
*/
@SuppressForbidden(reason = "provides storage to external components")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the usage of forbidden APIs is in a few places, I would consider it better to suppress only at the lowest level (sometimes I like wrapping those in a private method I suppress). The reason is that if an unintentional forbidden call creeps in it will be caught.

* Do not call while there are running jobs.
* @throws IOException if cleanup fails
*/
public void cleanupLocalTmpStorageInCaseOfUncleanShutdown () throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: space before brackets

* Do not call while there are running jobs.
* @throws IOException if cleanup fails
*/
public void cleanupLocalTmpStorageInCaseOfUncleanShutdown () throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name of the method seems to explain where it's expected to be called from. I wonder if that is necessary.

return tmpDirectory;
}
} catch (IOException e) {
LOGGER.debug("Failed to optain information about path [{}]: {}", path, e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: optain -> obtain

* requested size
* @return a Path to local storage or null if storage is not available
*/
public Path tryGetTmpStorage(JobTask jobTask, long requestedSizeAsBytes) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could pass the requested size as a ByteSizeValue and only convert at the lower level. This will make the API less error prone.

try {
removeTmpStorage(jobId);
} catch (IOException e) {
logger.error(new ParameterizedMessage("[{}]Failed to delete temporay files", jobId), e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: space after [{}] plus typo on temporay -> temporary

@hendrikmuhs hendrikmuhs force-pushed the forecast-517-off-heap-storage branch from f7d6c1c to 8d2e902 Compare May 9, 2018 07:21
Copy link
Contributor

@dimitris-athanasiou dimitris-athanasiou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM from my side as well

@droberts195 droberts195 added review and removed WIP labels May 9, 2018
Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@droberts195
Copy link
Contributor

I think in addition to the changes already made the assertion on line 229 of ForecastIT.java needs updating to match the updated error message from the C++.

@droberts195
Copy link
Contributor

I muted org.elasticsearch.xpack.ml.integration.ForecastIT.testMemoryLimit on master and 6.x as it will fail while the C++ and Java changes propagate through the various CI jobs.

To minimise build failure noise, please can you:

  1. Wait for a successful ml-cpp build
  2. Merge master into this PR branch
  3. Revert 6a8aa99 (which is where I muted the test)
  4. Update the assertion on line 229 of ForecastIT.java
  5. Make sure the PR build for this PR goes green

@hendrikmuhs hendrikmuhs force-pushed the forecast-517-off-heap-storage branch from 384c02d to 05cc40a Compare May 15, 2018 08:00
@hendrikmuhs
Copy link
Contributor Author

I rewrote the test to actually test overflowing to disk.

} catch (ElasticsearchStatusException e) {
if (e.getMessage().contains("disk space")) {
throw new ElasticsearchStatusException(
"Test likely fails due to insufficient disk space on test machine, please free up space.", e.status(), e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know how much free space a test machine has to have to avoid this situation? If it's 1GB that's probably OK, but if it's 5GB then this could annoy people if a multi-hour test run fails because of it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if the diskspace is below 5GB the folder for tmp storage will not be injected, which will cause the engine to fail.

I only see 2 options:

  • I lower the limit, 5GB is still quite large, but I would not go below 2GB.
  • I drop the test and we cover this as QA test

I do not think 5GB is that much of a problem these days, only if run in a virtual environment where virtual disks are small in size.

Do we have any information about diskspace usage of other integration tests? Indexes can get quite large as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only if run in a virtual environment where virtual disks are small in size

This is the situation I was thinking of. Tests are quite often run in VMs to test different operating systems etc.

I think it would be good to cover this as a QA test too, but maybe a third option would be instead of throwing an exception if there's a disk space error, use the "assume" feature if the free disk space in the build directory is less than 5GB? Something along the lines of this (but with the correct code for finding available disk space substituted):

    if (e.getMessage().contains("disk space")) {
        assumeTrue("Insufficient disk space to run test", something.getDiskSpaceInGB() > 5);
    }

If a test fails due to an assumption not holding then it does not fail the entire test suite.

Copy link
Contributor Author

@hendrikmuhs hendrikmuhs May 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this as well, but have bad feelings as this is like disabling the test, because no one reads the log, no one will notice if this starts failing.

To throw in another option: we could disable the 5 GB check for the test as I need much less (the models use 77MB heap size, the json is larger, but for sure below 500MB). To disable it, there could be a force parameter in the forecast action or we make the 5GB a node parameter and tweak it in the test setup.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we make the 5GB a node parameter and tweak it in the test setup

That sounds good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@droberts195 Can you take a quick look at 315fbde?


List<ForecastRequestStats> forecastStats = getForecastStats();
assertThat(forecastStats.size(), equalTo(1));
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This extra scoping level seems unnecessary.

@hendrikmuhs hendrikmuhs force-pushed the forecast-517-off-heap-storage branch from 5b0b460 to 315fbde Compare May 17, 2018 18:03
Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

hendrikmuhs pushed a commit that referenced this pull request May 18, 2018
…obs #30399

This implementation lazily (on 1st forecast request) checks for available
diskspace and creates a subfolder for storing data outside of Lucene
indexes, but as part of the ES data paths.

Details:
 - tmp storage is managed and does not allow allocation if disk space is
   below a threshold (5GB at the moment)
 - tmp storage is supposed to be managed by the native component but in
   case this fails cleanup is provided:
    - on job close
    - on process crash
    - after node crash, on restart
 - available space is re-checked for every forecast call (the native
   component has to check again before writing)

Note: The 1st path that has enough space is chosen on job open (job
close/reopen triggers a new search)
dnhatn added a commit that referenced this pull request May 19, 2018
* 6.x:
  Mute testCorruptFileThenSnapshotAndRestore
  Plugins: Remove meta plugins (#30670)
  Upgrade to Lucene-7.4.0-snapshot-59f2b7aec2 (#30726)
  Docs: Add uptasticsearch to list of clients (#30738)
  [TEST] Reduce forecast overflow to disk test memory limit (#30727)
  [DOCS] Removes redundant index.asciidoc files (#30707)
  [DOCS] Moves X-Pack configurationg pages in table of contents (#30702)
  [ML][TEST] Fix bucket count assertion in ModelPlotsIT (#30717)
  [ML][TEST] Make AutodetectMemoryLimitIT less fragile (#30716)
  [Build] Add test admin when starting gradle run with trial license and
  [ML] provide tmp storage for forecasting and possibly any ml native jobs #30399
  Tests: Fail if test watches could not be triggered (#30392)
  Watcher: Prevent duplicate watch triggering during upgrade (#30643)
  [ML] add version information in case of crash of native ML process (#30674)
  Add detailed assert message to IndexAuditUpgradeIT (#30669)
  Preserve REST client auth despite 401 response (#30558)
  Make TransportClusterStateAction abide to our style (#30697)
  [DOCS] Fixes edit URLs for stack overview (#30583)
  [DOCS] Add missing callout in IndicesClientDocumentationIT
  Backport get settings API changes to 6.x (#30494)
  Silence sleep based watcher test
  [DOCS] Replace X-Pack terms with attributes
  Improve explanation in rescore (#30629)
  [test] packaging: add windows boxes (#30402)
  [ML] Clean left behind model state docs (#30659)
  filters agg docs duplicated 'bucket' word removal (#30677)
  top_hits doc example description update (#30676)
  MovingFunction Pipeline agg backport to 6.x (#30658)
  [Docs] Replace InetSocketTransportAddress with TransportAdress (#30673)
  [TEST] Account for increase in ML C++ memory usage (#30675)
  User proper write-once semantics for GCS repository (#30438)
  Deprecate `nGram` and `edgeNGram` names for ngram filters (#30209)
  Watcher: Fix watch history template for dynamic slack attachments (#30172)
  Fix _cluster/state to always return cluster_uuid (#30656)
@droberts195
Copy link
Contributor

It looks like the changes were added to master in 6c313a9, although not via the PR merge mechanism, hence the PR is still open. I will close it.

@hendrikmuhs hendrikmuhs deleted the forecast-517-off-heap-storage branch October 18, 2018 08:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants