Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] DFA job gets stuck when no field except the dependent variable is included in the analysis #55593

Closed
blookot opened this issue Apr 22, 2020 · 9 comments · Fixed by #55876
Assignees
Labels
>bug :ml Machine learning

Comments

@blookot
Copy link

blookot commented Apr 22, 2020

Elasticsearch version (bin/elasticsearch --version): 7.6.2

JVM version (java -version): running on ESS

Description of the problem including expected versus actual behavior:

i'm runnng a regression data frame analytics job and it stops at 50% (loading data is 100% and analyzing is 0%)
can't understand why...

Steps to reproduce:

  1. load the csv file attached (rename it with csv extension)
  2. create the ml regression job on it (data analytics with this index source, all the rest default)
  3. start the job

Here is an example of ml job:

{
  "id": "test8",
  "description": "",
  "source": {
    "index": [
      "disk_usage"
    ],
    "query": {
      "match_all": {}
    },
    "_source": {
      "includes": [],
      "excludes": []
    }
  },
  "dest": {
    "index": "test8",
    "results_field": "ml"
  },
  "analysis": {
    "regression": {
      "dependent_variable": "disk_percent",
      "prediction_field_name": "disk_percent_prediction",
      "training_percent": 80,
      "randomize_seed": -2904501521181443000
    }
  },
  "analyzed_fields": {
    "includes": [],
    "excludes": []
  },
  "model_memory_limit": "100mb",
  "create_time": 1587562995031,
  "version": "7.6.2",
  "allow_lazy_start": false
}

logs don't tell anything:

2020-04-22 15:43:15 | instance-0000000000 | Created analytics with analysis type [regression]
-- | -- | --
  | 2020-04-22 15:43:17 | instance-0000000000 | Estimated memory usage for this analytics to be [18.2mb]
  | 2020-04-22 15:43:17 | instance-0000000000 | Starting analytics on node [{instance-0000000002}{3pArSzZmQpiVzw8sQqmcQA}{FUHdB0WDRU-gNIz1SpthHQ}{10.43.1.93}{10.43.1.93:19669}{l}{logical_availability_zone=zone-0, server_name=instance-0000000002.4e4d9d9dbfd3428da12363c78f9aa352, availability_zone=europe-west1-b, ml.machine_memory=1073741824, xpack.installed=true, instance_configuration=gcp.ml.1, ml.max_open_jobs=20, region=unknown-region}]
  | 2020-04-22 15:43:17 | instance-0000000000 | Started analytics
  | 2020-04-22 15:43:17 | instance-0000000002 | Creating destination index [test8]
  | 2020-04-22 15:43:18 | instance-0000000002 | Finished reindexing to destination index [test8]
  | 2020-04-22 15:59:06 | instance-0000000002 | Finished analysis
  | 2020-04-22 15:59:06 | instance-0000000000 | Stopped analytics

disk_usage.txt

@blookot blookot added >regression :ml Machine learning labels Apr 22, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml)

@dimitris-athanasiou
Copy link
Contributor

@blookot Could you please explain how you're indexing the data?

@blookot
Copy link
Author

blookot commented Apr 22, 2020

I'm loading the csv file using data visualizer @dimitris-athanasiou

@dimitris-athanasiou
Copy link
Contributor

Thank you @blookot. I have reproduced the issue. You have uncovered a bug that is caused because there are no features in this dataset. There is only the dependent_variable.

I think there are 2 issues to fix here:

  1. I think _start API should fail to run when this is the case
  2. The c++ process shouldn't get stuck even if this is the case

We'll proceed to fix them both.

Once again, thank you for reporting this. It helps us make the feature better!

@blookot
Copy link
Author

blookot commented Apr 22, 2020

Hi @dimitris-athanasiou
why can't we use timestamp as a feature?
in my case it's a disk slowly filling, and i'd like to use regression and inference to predict when my disk is gonna be full.
i can plot timestamp on x and disk usage on y and have a nice dot chart...
i guess this falls into the single metric ML (temporal) with forecast...

@blookot
Copy link
Author

blookot commented Apr 23, 2020

PS. CPU is running at 100% (on my ML node) until I stop the job!

@dimitris-athanasiou
Copy link
Contributor

dimitris-athanasiou commented Apr 23, 2020

Indeed, your use case is a time series analysis. You can use an anomaly detection job to model the data and then use the forecast feature in order to predict when the disk will be full.

Having said that, we're planning to revisit date features for data frame analytics jobs. We have not addressed them yet as they require special handling that we decided to defer until later in the project. This is not a promise that we'll support them though.

PS. CPU is running at 100% (on my ML node) until I stop the job!

Thanks for the note! I noticed that too. We'll make sure to fix this issue.

@blookot
Copy link
Author

blookot commented Apr 23, 2020

yes i've been playing (successfully) with single metric & forecast
i thought dates are stored as long (like unix epoch) so I imagined a 2D dot plot with the regression based on my timestamp and disk usage...
But I'll wait for it :-)
thanks again @dimitris-athanasiou

image

@tveasey
Copy link
Contributor

tveasey commented Apr 23, 2020

We have not addressed them yet as they require special handling that we decided to defer until later in the project.

Just to add to this, the regression model we use isn't immediately well suited to extrapolation, as needed for forecasting. To get it to work in this fashion needs some explicit handling in inference and also judicious feature creation. As @dimitris-athanasiou says, using this functionality to enhance our forecasting capabilities (particularly to include additional explanatory variables) is definitely something on the roadmap.

@dimitris-athanasiou dimitris-athanasiou changed the title [ML regression] job blocked at 50% [ML] DFA job gets stuck when no field except the dependent variable is included in the analysis Apr 28, 2020
dimitris-athanasiou added a commit to dimitris-athanasiou/elasticsearch that referenced this issue Apr 28, 2020
We were previously checking at least one supported field existed
when the _explain API was called. However, in the case of analyses
with required fields (e.g. regression) we were not accounting that
the dependent variable is not a feature and thus if the source index
only contains the dependent variable field there are no features to
train a model on.

This commit adds a validation that at least one feature is available
for analysis. Note that we also move that validation away from
`ExtractedFieldsDetector` and the _explain API and straight into
the _start API. The reason for doing this is to allow the user to use
the _explain API in order to understand why they would be seeing an
error like this one.

For example, the user might be using an index that has fields but
they are of unsupported types. If they start the job and get
an error that there are no features, they will wonder why that is.
Calling the _explain API will show them that all their fields are
unsupported. If the _explain API was failing instead, there would
be no way for the user to understand why all those fields are
ignored.

Closes elastic#55593
dimitris-athanasiou added a commit that referenced this issue Apr 29, 2020
)

We were previously checking at least one supported field existed
when the _explain API was called. However, in the case of analyses
with required fields (e.g. regression) we were not accounting that
the dependent variable is not a feature and thus if the source index
only contains the dependent variable field there are no features to
train a model on.

This commit adds a validation that at least one feature is available
for analysis. Note that we also move that validation away from
`ExtractedFieldsDetector` and the _explain API and straight into
the _start API. The reason for doing this is to allow the user to use
the _explain API in order to understand why they would be seeing an
error like this one.

For example, the user might be using an index that has fields but
they are of unsupported types. If they start the job and get
an error that there are no features, they will wonder why that is.
Calling the _explain API will show them that all their fields are
unsupported. If the _explain API was failing instead, there would
be no way for the user to understand why all those fields are
ignored.

Closes #55593
dimitris-athanasiou added a commit that referenced this issue Apr 29, 2020
…#55876) (#55914)

We were previously checking at least one supported field existed
when the _explain API was called. However, in the case of analyses
with required fields (e.g. regression) we were not accounting that
the dependent variable is not a feature and thus if the source index
only contains the dependent variable field there are no features to
train a model on.

This commit adds a validation that at least one feature is available
for analysis. Note that we also move that validation away from
`ExtractedFieldsDetector` and the _explain API and straight into
the _start API. The reason for doing this is to allow the user to use
the _explain API in order to understand why they would be seeing an
error like this one.

For example, the user might be using an index that has fields but
they are of unsupported types. If they start the job and get
an error that there are no features, they will wonder why that is.
Calling the _explain API will show them that all their fields are
unsupported. If the _explain API was failing instead, there would
be no way for the user to understand why all those fields are
ignored.

Closes #55593

Backport of #55876
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :ml Machine learning
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants