Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] Failure in ml.integration.RegressionIT.testStopAndRestart assertion #49095

Closed
przemekwitek opened this issue Nov 14, 2019 · 6 comments
Closed
Labels
:ml Machine learning >test-failure Triaged test failures from CI

Comments

@przemekwitek
Copy link
Contributor

After I unmuted the test, there were 2 CI failures:
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=debian-8&&immutable/389/console
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=amazon/389/console

With small changes in the code I was able to reproduce the issue locally today.
Here is the test log:

Suite: Test class org.elasticsearch.xpack.ml.integration.RegressionIT
  2> REPRODUCE WITH: ./gradlew ':x-pack:plugin:ml:qa:native-multi-node-tests:integTestRunner' --tests "org.elasticsearch.xpack.ml.integration.RegressionIT.testStopAndRestart" -Dtests.seed=2454927CB657C116 -Dtests.security.manager=true -Dtests.locale=th -Dtests.timezone=America/Indiana/Vevay -Dcompiler.java=12 -Druntime.java=12
  2> java.lang.AssertionError: Hits were: {"hits":{"total":{"value":2,"relation":"eq"},"max_score":0.0,"hits":[{"_index":".ml-inference-000001","_id":"regression_stop_and_restart-1573742320992","_score":0.0,"_source":{"model_id":"regression_stop_and_restart-1573742320992","created_by":"data-frame-analytics","version":"8.0.0","create_time":1573742320992,"tags":["regression_stop_and_restart"],"metadata":{"analytics_config":{"id":"regression_stop_and_restart","source":{"index":["regression_stop_and_restart_source_index"],"query":{"match_all":{}}},"dest":{"index":"regression_stop_and_restart_source_index_results","results_field":"ml"},"analysis":{"regression":{"dependent_variable":"variable","prediction_field_name":"variable_prediction","training_percent":100.0}},"model_memory_limit":"1gb","create_time":1573742309034,"version":"8.0.0","allow_lazy_start":false}},"doc_type":"trained_model_config","input":{"field_names":["feature","variable"]}}},{"_index":".ml-inference-000001","_id":"regression_stop_and_restart-1573742322238","_score":0.0,"_source":{"model_id":"regression_stop_and_restart-1573742322238","created_by":"data-frame-analytics","version":"8.0.0","create_time":1573742322238,"tags":["regression_stop_and_restart"],"metadata":{"analytics_config":{"id":"regression_stop_and_restart","source":{"index":["regression_stop_and_restart_source_index"],"query":{"match_all":{}}},"dest":{"index":"regression_stop_and_restart_source_index_results","results_field":"ml"},"analysis":{"regression":{"dependent_variable":"variable","prediction_field_name":"variable_prediction","training_percent":100.0}},"model_memory_limit":"1gb","create_time":1573742309034,"version":"8.0.0","allow_lazy_start":false}},"doc_type":"trained_model_config","input":{"field_names":["feature","variable"]}}}]}}
    Expected: an array with size <1>
         but: array size was <2>
        at __randomizedtesting.SeedInfo.seed([2454927CB657C116:531C1082A2BDC7B6]:0)
        at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
        at org.junit.Assert.assertThat(Assert.java:956)
        at org.elasticsearch.xpack.ml.integration.MlNativeDataFrameAnalyticsIntegTestCase.assertInferenceModelPersisted(MlNativeDataFrameAnalyticsIntegTestCase.java:200)
        at org.elasticsearch.xpack.ml.integration.RegressionIT.testStopAndRestart(RegressionIT.java:309)
  2> NOTE: leaving temporary files on disk at: /Users/witek/github/elastic/elasticsearch/x-pack/plugin/ml/qa/native-multi-node-tests/build/testrun/integTestRunner/temp/org.elasticsearch.xpack.ml.integration.RegressionIT_2454927CB657C116-001
  2> Nov 14, 2019 3:38:51 PM com.carrotsearch.randomizedtesting.ThreadLeakControl checkThreadLeaks
  2> WARNING: Will linger awaiting termination of 1 leaked thread(s).
  2> NOTE: test params are: codec=Lucene80, sim=Asserting(org.apache.lucene.search.similarities.AssertingSimilarity@67df4402), locale=th, timezone=America/Indiana/Vevay
  2> NOTE: Mac OS X 10.14.5 x86_64/Oracle Corporation 12 (64-bit)/cpus=8,threads=1,free=226957824,total=536870912
  2> NOTE: All tests run in this JVM: [RegressionIT]

@przemekwitek przemekwitek added >test-failure Triaged test failures from CI :ml Machine learning labels Nov 14, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml)

@przemekwitek
Copy link
Contributor Author

There is interesting stuff happening in the logs (see below).
It looks like despite the job being stopped, the result processor kept working.
After a start call, another processor was created and both stored inference model.
I'm going to investigate this more thoroughly now.

» [2019-11-14T15:38:32,616][INFO ][o.e.x.m.a.TransportStartDataFrameAnalyticsAction] [integTest-0] [regression_stop_and_restart] Starting data frame analytics
» [2019-11-14T15:38:34,650][INFO ][o.e.x.m.d.DataFrameAnalyticsManager] [integTest-0] [regression_stop_and_restart] Creating destination index [regression_stop_and_restart_source_index_results]
» [2019-11-14T15:38:37,445][INFO ][o.e.x.m.a.TransportStopDataFrameAnalyticsAction] [integTest-0] [regression_stop_and_restart] Stopping task with force [false]
» [2019-11-14T15:38:38,100][INFO ][o.e.x.m.d.p.AnalyticsProcessManager] [integTest-0] [regression_stop_and_restart] Waiting for result processor to complete
» [2019-11-14T15:38:39,027][INFO ][o.e.x.m.a.TransportStartDataFrameAnalyticsAction] [integTest-0] [regression_stop_and_restart] Starting data frame analytics
» [2019-11-14T15:38:40,498][INFO ][o.e.x.m.d.p.AnalyticsProcessManager] [integTest-0] [regression_stop_and_restart] Waiting for result processor to complete
» [2019-11-14T15:38:42,482][INFO ][o.e.x.m.d.p.AnalyticsResultProcessor] [integTest-0] [regression_stop_and_restart] Stored trained model with id [regression_stop_and_restart-1573742320992]
» [2019-11-14T15:38:42,671][INFO ][o.e.x.m.d.p.AnalyticsResultProcessor] [integTest-0] [regression_stop_and_restart] Stored trained model with id [regression_stop_and_restart-1573742322238]
» [2019-11-14T15:38:43,767][INFO ][o.e.x.m.d.p.AnalyticsProcessManager] [integTest-0] [regression_stop_and_restart] Result processor has completed
» [2019-11-14T15:38:43,767][INFO ][o.e.x.m.d.p.AnalyticsProcessManager] [integTest-0] [regression_stop_and_restart] Result processor has completed
» [2019-11-14T15:38:43,769][INFO ][o.e.x.m.d.p.AnalyticsProcessManager] [integTest-0] [regression_stop_and_restart] Closing process
» [2019-11-14T15:38:43,769][INFO ][o.e.x.m.d.p.AnalyticsProcessManager] [integTest-0] [regression_stop_and_restart] Closing process
» [2019-11-14T15:38:43,770][INFO ][o.e.x.m.p.AbstractNativeProcess] [integTest-0] [regression_stop_and_restart] State output finished
» [2019-11-14T15:38:43,769][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [integTest-0] [regression_stop_and_restart] [data_frame_analyzer/97981] [Main.cc@226] [{"name":"E_DFTPMEstimatedPeakMemoryUsage","description":"The upfront estimate of the peak memory training the predictive model would use","value":3384615}
»  ,{"name":"E_DFTPMPeakMemoryUsage","description":"The peak memory training the predictive model used","value":7092}
»  ,{"name":"E_DFTPMTimeToTrain","description":"The time it took to train the predictive model","value":1685}
»  ,{"name":"E_DFTPMTrainedForestNumberTrees","description":"The total number of trees in the trained forest","value":2}
»  ]
» [2019-11-14T15:38:43,769][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [integTest-0] [regression_stop_and_restart] [data_frame_analyzer/97979] [Main.cc@226] [{"name":"E_DFTPMEstimatedPeakMemoryUsage","description":"The upfront estimate of the peak memory training the predictive model would use","value":3384615}
»  ,{"name":"E_DFTPMPeakMemoryUsage","description":"The peak memory training the predictive model used","value":21881}
»  ,{"name":"E_DFTPMTimeToTrain","description":"The time it took to train the predictive model","value":2770}
»  ,{"name":"E_DFTPMTrainedForestNumberTrees","description":"The total number of trees in the trained forest","value":12}
»  ]
» [2019-11-14T15:38:43,769][INFO ][o.e.x.m.p.AbstractNativeProcess] [integTest-0] [regression_stop_and_restart] State output finished
» [2019-11-14T15:38:43,771][INFO ][o.e.x.m.d.p.AnalyticsProcessManager] [integTest-0] [regression_stop_and_restart] Closed process
» [2019-11-14T15:38:43,771][INFO ][o.e.x.m.d.p.AnalyticsProcessManager] [integTest-0] [regression_stop_and_restart] Closed process
» [2019-11-14T15:38:43,771][INFO ][o.e.x.m.d.p.AnalyticsProcessManager] [integTest-0] [regression_stop_and_restart] Marking task completed
» [2019-11-14T15:38:43,771][INFO ][o.e.x.m.d.p.AnalyticsProcessManager] [integTest-0] [regression_stop_and_restart] Marking task completed

@przemekwitek
Copy link
Contributor Author

I've just raised a PR (#49167) to mute the problematic test as I've found it reproducible locally so no need to pollute CI with failures.

@przemekwitek
Copy link
Contributor Author

I've found out that there can be a situation in which AnalyticsProcessManager.processContextByAllocation map contains two process context for the same job (both with 2 distinct keys - allocation ids). In this situation two result processors can run at the same time which causes them both to persist results.

I'll try to figure out how to fix this. Probably _stop should be fixed to cancel result processor or at least wait for it. Possibly, we should also throw/warn if we find out on _start that the process context for the job already exists (but then how do we know if it is the same job or the new job with the same name?).
Anyway, this issue needs more thorough investigation.

@dimitris-athanasiou
Copy link
Contributor

That's interesting. I'll have a look as well and we can discuss together Monday.

@przemekwitek
Copy link
Contributor Author

@droberts195 droberts195 changed the title [CI] Failure in ml.integration.RegressionIT.testStopAndRestore assertion [CI] Failure in ml.integration.RegressionIT.testStopAndRestart assertion Nov 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml Machine learning >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

3 participants