[ML] DFA job gets stuck when no field except the dependent variable is included in the analysis #55593

blookot · 2020-04-22T14:29:55Z

Elasticsearch version (bin/elasticsearch --version): 7.6.2

JVM version (java -version): running on ESS

Description of the problem including expected versus actual behavior:

i'm runnng a regression data frame analytics job and it stops at 50% (loading data is 100% and analyzing is 0%)
can't understand why...

Steps to reproduce:

load the csv file attached (rename it with csv extension)
create the ml regression job on it (data analytics with this index source, all the rest default)
start the job

Here is an example of ml job:

{
  "id": "test8",
  "description": "",
  "source": {
    "index": [
      "disk_usage"
    ],
    "query": {
      "match_all": {}
    },
    "_source": {
      "includes": [],
      "excludes": []
    }
  },
  "dest": {
    "index": "test8",
    "results_field": "ml"
  },
  "analysis": {
    "regression": {
      "dependent_variable": "disk_percent",
      "prediction_field_name": "disk_percent_prediction",
      "training_percent": 80,
      "randomize_seed": -2904501521181443000
    }
  },
  "analyzed_fields": {
    "includes": [],
    "excludes": []
  },
  "model_memory_limit": "100mb",
  "create_time": 1587562995031,
  "version": "7.6.2",
  "allow_lazy_start": false
}

logs don't tell anything:

2020-04-22 15:43:15 | instance-0000000000 | Created analytics with analysis type [regression]
-- | -- | --
  | 2020-04-22 15:43:17 | instance-0000000000 | Estimated memory usage for this analytics to be [18.2mb]
  | 2020-04-22 15:43:17 | instance-0000000000 | Starting analytics on node [{instance-0000000002}{3pArSzZmQpiVzw8sQqmcQA}{FUHdB0WDRU-gNIz1SpthHQ}{10.43.1.93}{10.43.1.93:19669}{l}{logical_availability_zone=zone-0, server_name=instance-0000000002.4e4d9d9dbfd3428da12363c78f9aa352, availability_zone=europe-west1-b, ml.machine_memory=1073741824, xpack.installed=true, instance_configuration=gcp.ml.1, ml.max_open_jobs=20, region=unknown-region}]
  | 2020-04-22 15:43:17 | instance-0000000000 | Started analytics
  | 2020-04-22 15:43:17 | instance-0000000002 | Creating destination index [test8]
  | 2020-04-22 15:43:18 | instance-0000000002 | Finished reindexing to destination index [test8]
  | 2020-04-22 15:59:06 | instance-0000000002 | Finished analysis
  | 2020-04-22 15:59:06 | instance-0000000000 | Stopped analytics

disk_usage.txt

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-04-22T14:29:57Z

Pinging @elastic/ml-core (:ml)

dimitris-athanasiou · 2020-04-22T14:32:22Z

@blookot Could you please explain how you're indexing the data?

blookot · 2020-04-22T15:22:35Z

I'm loading the csv file using data visualizer @dimitris-athanasiou

dimitris-athanasiou · 2020-04-22T16:15:15Z

Thank you @blookot. I have reproduced the issue. You have uncovered a bug that is caused because there are no features in this dataset. There is only the dependent_variable.

I think there are 2 issues to fix here:

I think _start API should fail to run when this is the case
The c++ process shouldn't get stuck even if this is the case

We'll proceed to fix them both.

Once again, thank you for reporting this. It helps us make the feature better!

blookot · 2020-04-22T18:49:06Z

Hi @dimitris-athanasiou
why can't we use timestamp as a feature?
in my case it's a disk slowly filling, and i'd like to use regression and inference to predict when my disk is gonna be full.
i can plot timestamp on x and disk usage on y and have a nice dot chart...
i guess this falls into the single metric ML (temporal) with forecast...

blookot · 2020-04-23T07:17:26Z

PS. CPU is running at 100% (on my ML node) until I stop the job!

dimitris-athanasiou · 2020-04-23T08:00:12Z

Indeed, your use case is a time series analysis. You can use an anomaly detection job to model the data and then use the forecast feature in order to predict when the disk will be full.

Having said that, we're planning to revisit date features for data frame analytics jobs. We have not addressed them yet as they require special handling that we decided to defer until later in the project. This is not a promise that we'll support them though.

PS. CPU is running at 100% (on my ML node) until I stop the job!

Thanks for the note! I noticed that too. We'll make sure to fix this issue.

blookot · 2020-04-23T09:25:30Z

yes i've been playing (successfully) with single metric & forecast
i thought dates are stored as long (like unix epoch) so I imagined a 2D dot plot with the regression based on my timestamp and disk usage...
But I'll wait for it :-)
thanks again @dimitris-athanasiou

tveasey · 2020-04-23T09:52:13Z

We have not addressed them yet as they require special handling that we decided to defer until later in the project.

Just to add to this, the regression model we use isn't immediately well suited to extrapolation, as needed for forecasting. To get it to work in this fashion needs some explicit handling in inference and also judicious feature creation. As @dimitris-athanasiou says, using this functionality to enhance our forecasting capabilities (particularly to include additional explanatory variables) is definitely something on the roadmap.

We were previously checking at least one supported field existed when the _explain API was called. However, in the case of analyses with required fields (e.g. regression) we were not accounting that the dependent variable is not a feature and thus if the source index only contains the dependent variable field there are no features to train a model on. This commit adds a validation that at least one feature is available for analysis. Note that we also move that validation away from `ExtractedFieldsDetector` and the _explain API and straight into the _start API. The reason for doing this is to allow the user to use the _explain API in order to understand why they would be seeing an error like this one. For example, the user might be using an index that has fields but they are of unsupported types. If they start the job and get an error that there are no features, they will wonder why that is. Calling the _explain API will show them that all their fields are unsupported. If the _explain API was failing instead, there would be no way for the user to understand why all those fields are ignored. Closes elastic#55593

) We were previously checking at least one supported field existed when the _explain API was called. However, in the case of analyses with required fields (e.g. regression) we were not accounting that the dependent variable is not a feature and thus if the source index only contains the dependent variable field there are no features to train a model on. This commit adds a validation that at least one feature is available for analysis. Note that we also move that validation away from `ExtractedFieldsDetector` and the _explain API and straight into the _start API. The reason for doing this is to allow the user to use the _explain API in order to understand why they would be seeing an error like this one. For example, the user might be using an index that has fields but they are of unsupported types. If they start the job and get an error that there are no features, they will wonder why that is. Calling the _explain API will show them that all their fields are unsupported. If the _explain API was failing instead, there would be no way for the user to understand why all those fields are ignored. Closes #55593

…#55876) (#55914) We were previously checking at least one supported field existed when the _explain API was called. However, in the case of analyses with required fields (e.g. regression) we were not accounting that the dependent variable is not a feature and thus if the source index only contains the dependent variable field there are no features to train a model on. This commit adds a validation that at least one feature is available for analysis. Note that we also move that validation away from `ExtractedFieldsDetector` and the _explain API and straight into the _start API. The reason for doing this is to allow the user to use the _explain API in order to understand why they would be seeing an error like this one. For example, the user might be using an index that has fields but they are of unsupported types. If they start the job and get an error that there are no features, they will wonder why that is. Calling the _explain API will show them that all their fields are unsupported. If the _explain API was failing instead, there would be no way for the user to understand why all those fields are ignored. Closes #55593 Backport of #55876

blookot added >regression :ml Machine learning labels Apr 22, 2020

dimitris-athanasiou self-assigned this Apr 22, 2020

dimitris-athanasiou added >bug and removed >regression labels Apr 22, 2020

tveasey mentioned this issue Apr 23, 2020

[ML] Trap and fail if no regressors are provided to classification or regression elastic/ml-cpp#1160

Merged

dimitris-athanasiou changed the title ~~[ML regression] job blocked at 50%~~ [ML] DFA job gets stuck when no field except the dependent variable is included in the analysis Apr 28, 2020

dimitris-athanasiou mentioned this issue Apr 28, 2020

[ML] Validate at least one feature is available for DF analytics #55876

Merged

dimitris-athanasiou closed this as completed in #55876 Apr 29, 2020

dimitris-athanasiou mentioned this issue Apr 29, 2020

[7.x][ML] Validate at least one feature is available for DF analytics… #55914

Merged

alvarezmelissa87 mentioned this issue Apr 29, 2020

[ML] DFA job fails on start when no field except the dependent variable is included in the analysis elastic/kibana#64783

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] DFA job gets stuck when no field except the dependent variable is included in the analysis #55593

[ML] DFA job gets stuck when no field except the dependent variable is included in the analysis #55593

blookot commented Apr 22, 2020

elasticmachine commented Apr 22, 2020

dimitris-athanasiou commented Apr 22, 2020

blookot commented Apr 22, 2020

dimitris-athanasiou commented Apr 22, 2020

blookot commented Apr 22, 2020

blookot commented Apr 23, 2020

dimitris-athanasiou commented Apr 23, 2020 •

edited

Loading

blookot commented Apr 23, 2020

tveasey commented Apr 23, 2020

[ML] DFA job gets stuck when no field except the dependent variable is included in the analysis #55593

[ML] DFA job gets stuck when no field except the dependent variable is included in the analysis #55593

Comments

blookot commented Apr 22, 2020

elasticmachine commented Apr 22, 2020

dimitris-athanasiou commented Apr 22, 2020

blookot commented Apr 22, 2020

dimitris-athanasiou commented Apr 22, 2020

blookot commented Apr 22, 2020

blookot commented Apr 23, 2020

dimitris-athanasiou commented Apr 23, 2020 • edited Loading

blookot commented Apr 23, 2020

tveasey commented Apr 23, 2020

dimitris-athanasiou commented Apr 23, 2020 •

edited

Loading