[ML] provide tmp storage for forecasting and possibly any ml native jobs #30399

hendrikmuhs · 2018-05-04T15:13:00Z

This implementation lazily (on 1st forecast request) checks for available diskspace and creates a subfolder for storing data outside of Lucene indexes, but as part of the ES data paths.

Details:

tmp storage is managed and does not allow allocation if diskspace is below a threshold (5GB at the moment)
tmp storage is supposed to be managed by the native component but in case this fails cleanup is provided:
- on job close
- on process crash
- after node crash, on restart
available space is re-checked for every forecast call (the native component has to check again before writing)

Note:

the 1st path that has enough space is choosen
path can not switch once it is choosen (job close/reopen triggers a new search)

elasticmachine · 2018-05-04T15:13:02Z

Pinging @elastic/ml-core

hendrikmuhs · 2018-05-08T07:15:30Z

FWIW: I manually tested this change again, just to be sure.

I will rebase the change after the review.

edsavage

LGTM

Just a couple of typos spotted in passing

edsavage · 2018-05-08T09:30:04Z

...rc/main/java/org/elasticsearch/xpack/ml/job/process/autodetect/AutodetectProcessManager.java

+            try {
+                removeTmpStorage(jobTask.getJobId());
+            } catch (IOException e) {
+                logger.error(new ParameterizedMessage("[{}]Failed to delete temporay files", jobTask.getJobId()), e);


typo: temporary

edsavage · 2018-05-08T09:30:20Z

...rc/main/java/org/elasticsearch/xpack/ml/job/process/autodetect/AutodetectProcessManager.java

+        try {
+            removeTmpStorage(jobId);
+        } catch (IOException e) {
+            logger.error(new ParameterizedMessage("[{}]Failed to delete temporay files", jobId), e);


typo: temporary

droberts195

I left a couple of minor comments.

Also, is there a plan to test this end-to-end somewhere. I can see that, due to the size of data required to actually use the temp storage, it's probably best not done as part of the tests in the elasticsearch repo. But is there a plan to exercise the functionality using one of the larger data sets in the ML QA repo?

droberts195 · 2018-05-08T09:46:05Z

...ck/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/job/process/NativeStorageProvider.java

+/**
+ * Provide storage for native components.
+ */
+@SuppressForbidden(reason = "provides storage to external components")


What exactly is forbidden that this class uses? Is it possible to scope this annotation to just 1 or 2 methods rather than the whole class?

As far as I remember it had something todo with the use of IOUtils, but as you pointed out below, a lot has changed since the 1st implementation. I will re-visit.

droberts195 · 2018-05-08T09:48:17Z

...ck/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/job/process/NativeStorageProvider.java

+package org.elasticsearch.xpack.ml.job.process;
+
+import org.apache.logging.log4j.Logger;
+import org.apache.lucene.util.IOUtils;


I think this should be org.elasticsearch.core.internal.io.IOUtils now. Since this work started there has been another PR that switched from Lucene's IO utils to an equivalent Elastic class.

dimitris-athanasiou

Looks good, a few issues to address.

dimitris-athanasiou · 2018-05-08T09:41:13Z

docs/CHANGELOG.asciidoc

@@ -156,6 +156,9 @@ synchronous and predictable. Also the trigger engine thread is only started on
 data nodes. And the Execute Watch API can be triggered regardless is watcher is
 started or stopped. ({pull}30118[#30118])

+Forecasting of Machine Learning job time series is now supported for large jobs
+by temporary storing model state on disk. ({pull}30399[#30399])


nit: temporary -> temporarily

dimitris-athanasiou · 2018-05-08T09:44:44Z

...rc/main/java/org/elasticsearch/xpack/ml/job/process/autodetect/AutodetectProcessManager.java

+    public void onNodeStartup() {
+        try {
+            nativeStorageProvider.cleanupLocalTmpStorageInCaseOfUncleanShutdown();
+        } catch (IOException e) {


It might be worth catching Exception here as a runtime exception creeping in in the future might prevent the node from starting and it's probably not a good reason to do so.

dimitris-athanasiou · 2018-05-08T09:50:41Z

...ck/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/job/process/NativeStorageProvider.java

+/**
+ * Provide storage for native components.
+ */
+@SuppressForbidden(reason = "provides storage to external components")


If the usage of forbidden APIs is in a few places, I would consider it better to suppress only at the lowest level (sometimes I like wrapping those in a private method I suppress). The reason is that if an unintentional forbidden call creeps in it will be caught.

dimitris-athanasiou · 2018-05-08T09:51:28Z

...ck/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/job/process/NativeStorageProvider.java

+     * Do not call while there are running jobs.
+     * @throws IOException if cleanup fails
+     */
+    public void cleanupLocalTmpStorageInCaseOfUncleanShutdown () throws IOException {


nit: space before brackets

dimitris-athanasiou · 2018-05-08T09:53:23Z

...ck/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/job/process/NativeStorageProvider.java

+     * Do not call while there are running jobs.
+     * @throws IOException if cleanup fails
+     */
+    public void cleanupLocalTmpStorageInCaseOfUncleanShutdown () throws IOException {


The name of the method seems to explain where it's expected to be called from. I wonder if that is necessary.

dimitris-athanasiou · 2018-05-08T09:56:04Z

...ck/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/job/process/NativeStorageProvider.java

+                    return tmpDirectory;
+                }
+            } catch (IOException e) {
+                LOGGER.debug("Failed to optain information about path [{}]: {}", path, e);


typo: optain -> obtain

dimitris-athanasiou · 2018-05-08T09:57:51Z

...rc/main/java/org/elasticsearch/xpack/ml/job/process/autodetect/AutodetectProcessManager.java

+     *            requested size
+     * @return a Path to local storage or null if storage is not available
+     */
+    public Path tryGetTmpStorage(JobTask jobTask, long requestedSizeAsBytes) {


I think we could pass the requested size as a ByteSizeValue and only convert at the lower level. This will make the API less error prone.

dimitris-athanasiou · 2018-05-08T09:59:52Z

...rc/main/java/org/elasticsearch/xpack/ml/job/process/autodetect/AutodetectProcessManager.java

+        try {
+            removeTmpStorage(jobId);
+        } catch (IOException e) {
+            logger.error(new ParameterizedMessage("[{}]Failed to delete temporay files", jobId), e);


nit: space after [{}] plus typo on temporay -> temporary

dimitris-athanasiou

LGTM from my side as well

droberts195

LGTM

droberts195 · 2018-05-14T11:39:16Z

I think in addition to the changes already made the assertion on line 229 of ForecastIT.java needs updating to match the updated error message from the C++.

Relates #30399

droberts195 · 2018-05-14T11:58:16Z

I muted org.elasticsearch.xpack.ml.integration.ForecastIT.testMemoryLimit on master and 6.x as it will fail while the C++ and Java changes propagate through the various CI jobs.

To minimise build failure noise, please can you:

Wait for a successful ml-cpp build
Merge master into this PR branch
Revert 6a8aa99 (which is where I muted the test)
Update the assertion on line 229 of ForecastIT.java
Make sure the PR build for this PR goes green

hendrikmuhs · 2018-05-15T12:12:21Z

I rewrote the test to actually test overflowing to disk.

droberts195 · 2018-05-15T12:11:39Z

x-pack/qa/ml-native-tests/src/test/java/org/elasticsearch/xpack/ml/integration/ForecastIT.java

+        } catch (ElasticsearchStatusException e) {
+            if (e.getMessage().contains("disk space")) {
+                throw new ElasticsearchStatusException(
+                        "Test likely fails due to insufficient disk space on test machine, please free up space.", e.status(), e);


Do you know how much free space a test machine has to have to avoid this situation? If it's 1GB that's probably OK, but if it's 5GB then this could annoy people if a multi-hour test run fails because of it.

Yes, if the diskspace is below 5GB the folder for tmp storage will not be injected, which will cause the engine to fail.

I only see 2 options:

I lower the limit, 5GB is still quite large, but I would not go below 2GB.

I drop the test and we cover this as QA test

I do not think 5GB is that much of a problem these days, only if run in a virtual environment where virtual disks are small in size.

Do we have any information about diskspace usage of other integration tests? Indexes can get quite large as well.

only if run in a virtual environment where virtual disks are small in size

This is the situation I was thinking of. Tests are quite often run in VMs to test different operating systems etc.

I think it would be good to cover this as a QA test too, but maybe a third option would be instead of throwing an exception if there's a disk space error, use the "assume" feature if the free disk space in the build directory is less than 5GB? Something along the lines of this (but with the correct code for finding available disk space substituted):

if (e.getMessage().contains("disk space")) { assumeTrue("Insufficient disk space to run test", something.getDiskSpaceInGB() > 5); }

If a test fails due to an assumption not holding then it does not fail the entire test suite.

I thought about this as well, but have bad feelings as this is like disabling the test, because no one reads the log, no one will notice if this starts failing.

To throw in another option: we could disable the 5 GB check for the test as I need much less (the models use 77MB heap size, the json is larger, but for sure below 500MB). To disable it, there could be a force parameter in the forecast action or we make the 5GB a node parameter and tweak it in the test setup.

we make the 5GB a node parameter and tweak it in the test setup

That sounds good.

@droberts195 Can you take a quick look at 315fbde?

droberts195 · 2018-05-15T12:14:35Z

x-pack/qa/ml-native-tests/src/test/java/org/elasticsearch/xpack/ml/integration/ForecastIT.java

+
+        List<ForecastRequestStats> forecastStats = getForecastStats();
+        assertThat(forecastStats.size(), equalTo(1));
+        {


This extra scoping level seems unnecessary.

droberts195

LGTM

…obs #30399 This implementation lazily (on 1st forecast request) checks for available diskspace and creates a subfolder for storing data outside of Lucene indexes, but as part of the ES data paths. Details: - tmp storage is managed and does not allow allocation if disk space is below a threshold (5GB at the moment) - tmp storage is supposed to be managed by the native component but in case this fails cleanup is provided: - on job close - on process crash - after node crash, on restart - available space is re-checked for every forecast call (the native component has to check again before writing) Note: The 1st path that has enough space is chosen on job open (job close/reopen triggers a new search)

* 6.x: Mute testCorruptFileThenSnapshotAndRestore Plugins: Remove meta plugins (#30670) Upgrade to Lucene-7.4.0-snapshot-59f2b7aec2 (#30726) Docs: Add uptasticsearch to list of clients (#30738) [TEST] Reduce forecast overflow to disk test memory limit (#30727) [DOCS] Removes redundant index.asciidoc files (#30707) [DOCS] Moves X-Pack configurationg pages in table of contents (#30702) [ML][TEST] Fix bucket count assertion in ModelPlotsIT (#30717) [ML][TEST] Make AutodetectMemoryLimitIT less fragile (#30716) [Build] Add test admin when starting gradle run with trial license and [ML] provide tmp storage for forecasting and possibly any ml native jobs #30399 Tests: Fail if test watches could not be triggered (#30392) Watcher: Prevent duplicate watch triggering during upgrade (#30643) [ML] add version information in case of crash of native ML process (#30674) Add detailed assert message to IndexAuditUpgradeIT (#30669) Preserve REST client auth despite 401 response (#30558) Make TransportClusterStateAction abide to our style (#30697) [DOCS] Fixes edit URLs for stack overview (#30583) [DOCS] Add missing callout in IndicesClientDocumentationIT Backport get settings API changes to 6.x (#30494) Silence sleep based watcher test [DOCS] Replace X-Pack terms with attributes Improve explanation in rescore (#30629) [test] packaging: add windows boxes (#30402) [ML] Clean left behind model state docs (#30659) filters agg docs duplicated 'bucket' word removal (#30677) top_hits doc example description update (#30676) MovingFunction Pipeline agg backport to 6.x (#30658) [Docs] Replace InetSocketTransportAddress with TransportAdress (#30673) [TEST] Account for increase in ML C++ memory usage (#30675) User proper write-once semantics for GCS repository (#30438) Deprecate `nGram` and `edgeNGram` names for ngram filters (#30209) Watcher: Fix watch history template for dynamic slack attachments (#30172) Fix _cluster/state to always return cluster_uuid (#30656)

droberts195 · 2018-05-23T07:38:30Z

It looks like the changes were added to master in 6c313a9, although not via the PR merge mechanism, hence the PR is still open. I will close it.

hendrikmuhs added >enhancement WIP v7.0.0 :ml Machine learning v6.4.0 labels May 4, 2018

hendrikmuhs mentioned this pull request May 4, 2018

[ML] Feature/forecast scale elastic/ml-cpp#89

Merged

2 tasks

hendrikmuhs requested review from droberts195 and dimitris-athanasiou May 8, 2018 07:13

edsavage approved these changes May 8, 2018

View reviewed changes

droberts195 reviewed May 8, 2018

View reviewed changes

dimitris-athanasiou reviewed May 8, 2018

View reviewed changes

hendrikmuhs force-pushed the forecast-517-off-heap-storage branch from f7d6c1c to 8d2e902 Compare May 9, 2018 07:21

dimitris-athanasiou approved these changes May 9, 2018

View reviewed changes

droberts195 added review and removed WIP labels May 9, 2018

droberts195 approved these changes May 9, 2018

View reviewed changes

droberts195 added a commit that referenced this pull request May 14, 2018

[TEST] Mute ML test that needs updating to following ml-cpp changes

6a8aa99

Relates #30399

droberts195 added a commit that referenced this pull request May 14, 2018

[TEST] Mute ML test that needs updating to following ml-cpp changes

ff1cc8e

Relates #30399

hendrikmuhs force-pushed the forecast-517-off-heap-storage branch from 384c02d to 05cc40a Compare May 15, 2018 08:00

droberts195 reviewed May 15, 2018

View reviewed changes

Hendrik Muhs added 5 commits May 17, 2018 20:02

provide tmp storage for forecasting and possibly any ml native jobs

3799538

fix a couple of review comments

3fe3f0b

further enhancements after review

6edad16

remove limitation from documentation

4e81d7c

fix formating

02338b9

Hendrik Muhs added 5 commits May 17, 2018 20:02

fix javadoc

2c59a45

turn memory limit test into testing overflowing to disk

11c5dab

improve formating

2075b2a

fix possible NPE

9275e1c

make minimum disk space configurable and lower the boundary for testing

315fbde

hendrikmuhs force-pushed the forecast-517-off-heap-storage branch from 5b0b460 to 315fbde Compare May 17, 2018 18:03

droberts195 approved these changes May 18, 2018

View reviewed changes

droberts195 mentioned this pull request May 18, 2018

[TEST] Reduce forecast overflow to disk test memory limit #30727

Merged

droberts195 closed this May 23, 2018

hendrikmuhs deleted the forecast-517-off-heap-storage branch October 18, 2018 08:30

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] provide tmp storage for forecasting and possibly any ml native jobs #30399

[ML] provide tmp storage for forecasting and possibly any ml native jobs #30399

hendrikmuhs commented May 4, 2018

elasticmachine commented May 4, 2018

hendrikmuhs commented May 8, 2018

edsavage left a comment

edsavage May 8, 2018

edsavage May 8, 2018

droberts195 left a comment

droberts195 May 8, 2018

hendrikmuhs May 9, 2018

droberts195 May 8, 2018

dimitris-athanasiou left a comment

dimitris-athanasiou May 8, 2018

dimitris-athanasiou May 8, 2018

dimitris-athanasiou May 8, 2018

dimitris-athanasiou May 8, 2018

dimitris-athanasiou May 8, 2018

dimitris-athanasiou May 8, 2018

dimitris-athanasiou May 8, 2018

dimitris-athanasiou May 8, 2018

dimitris-athanasiou left a comment

droberts195 left a comment

droberts195 commented May 14, 2018

droberts195 commented May 14, 2018

hendrikmuhs commented May 15, 2018

droberts195 May 15, 2018

hendrikmuhs May 15, 2018

droberts195 May 15, 2018

hendrikmuhs May 15, 2018 •

edited

Loading

droberts195 May 15, 2018

hendrikmuhs May 18, 2018

droberts195 May 15, 2018

droberts195 left a comment

droberts195 commented May 23, 2018

[ML] provide tmp storage for forecasting and possibly any ml native jobs #30399

[ML] provide tmp storage for forecasting and possibly any ml native jobs #30399

Conversation

hendrikmuhs commented May 4, 2018

elasticmachine commented May 4, 2018

hendrikmuhs commented May 8, 2018

edsavage left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

droberts195 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dimitris-athanasiou left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dimitris-athanasiou left a comment

Choose a reason for hiding this comment

droberts195 left a comment

Choose a reason for hiding this comment

droberts195 commented May 14, 2018

droberts195 commented May 14, 2018

hendrikmuhs commented May 15, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hendrikmuhs May 15, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

droberts195 left a comment

Choose a reason for hiding this comment

droberts195 commented May 23, 2018

hendrikmuhs May 15, 2018 •

edited

Loading