ML: creating ML State write alias and pointing writes there #37483

benwtrent · 2019-01-15T15:11:04Z

This adds an alias .ml-state-write in front of .ml-state. This way when we roll .ml-state in the future, we can easily redirect the writes without downtime.

I put the alias + index creation in the TransportOpenJobAction because:

We need to make sure that if the node running the job is the ONLY one updated, that it can immediately write to the appropriate alias
Using a cluster state watcher can run into a race condition (I saw this in tests) where .ml-state-write would end up being created dynamically as a concrete index because a Job wrote state info before the cluster state watcher could create the alias.
There is no point in manually creating this alias and index unless there is an open job.

elasticmachine · 2019-01-15T15:11:07Z

Pinging @elastic/ml-core

droberts195 · 2019-01-15T17:46:22Z

I put the alias + index creation in the TransportOpenJobAction

Where to create the alias is a harder problem than I realised. The problem with creating it in TransportOpenJobAction is that a cluster with opened jobs could be upgraded to 6.7 in a rolling upgrade, without any new jobs being opened. The already-opened jobs would at some point be reassigned to the upgraded 6.7 nodes which would expect to be able to use the alias but it wouldn't exist.

Another problem now we've decided to use the migration assistant reindexing for old 5.x indices is that .ml-state might be an alias pointing at a reindexed index chosen by the migration assistant, say .ml-state-reindexed-6. So in that case we'll need to instead create the .ml-state-write alias on the .ml-state-reindexed-6.

It's almost as though every single write to the state index needs to be wrapped in a method that:

Checks the current cluster state to see if an .ml-state-write alias exists - if so proceed to the actual state write
Checks the current cluster state to see if .ml-state is an alias - if so create an alias .ml-state-write pointing at the same index that .ml-state points to, then proceed to the actual state write
Checks the current cluster state to see if .ml-state is an index - if so create an alias .ml-state-write pointing at .ml-state, then proceed to the actual state write
Create an index .ml-state with alias .ml-state-write, then proceed to the actual state write

Because the 3 checks could be made against existing in-memory cluster state they should be fast enough that they don't cause significant overhead.

This is only an idea though. Let's see if anyone else has a better idea before making any code changes.

benwtrent · 2019-01-15T18:26:20Z

So in that case we'll need to instead create the .ml-state-write alias on the .ml-state-reindexed-6.

The migration assistant should MOVE aliases over, if it will not, .ml-anomalies-* cannot be migrated by it.

As for the rolling upgrade problem, we COULD do the alias creation on Cluster state changes as migrating a persistent task requires that state change. However, there MAY be a race condition as the job is started and running after migrating before the state change is handled by whatever is creating the alias :(.

Is there a way to hook into the persistent task transfer to create the alias at that moment?

davidkyle · 2019-01-15T18:39:19Z

As for the rolling upgrade problem...

The timing is difficult as if a job is left open we cannot predict when it will start up. That may happen even before the config migrator starts its work. We could move the current code in TransportOpenJobAction that creates the aliases to AutodetectProcessManager.openJob that way any job that starts will have to correct aliases.

benwtrent · 2019-01-15T19:01:08Z

AutodetectProcessManager.openJob Ah, yeah.

From what I can see here:

elasticsearch/server/src/main/java/org/elasticsearch/persistent/PersistentTasksNodeService.java

Lines 104 to 115 in 0227260

    
           if (Objects.equals(tasks, previousTasks) == false || event.nodesChanged()) { 
        
               // We have some changes let's check if they are related to our node 
        
               String localNodeId = event.state().getNodes().getLocalNodeId(); 
        
               Set<Long> notVisitedTasks = new HashSet<>(runningTasks.keySet()); 
        
               if (tasks != null) { 
        
                   for (PersistentTask<?> taskInProgress : tasks.tasks()) { 
        
                       if (localNodeId.equals(taskInProgress.getExecutorNode())) { 
        
                           Long allocationId = taskInProgress.getAllocationId(); 
        
                           AllocatedPersistentTask persistentTask = runningTasks.get(allocationId); 
        
                           if (persistentTask == null) { 
        
                               // New task - let's start it 
        
                               startTask(taskInProgress);

And:

elasticsearch/server/src/main/java/org/elasticsearch/persistent/PersistentTasksCustomMetaData.java

Lines 592 to 601 in f4e9729

    
           public Builder reassignTask(String taskId, Assignment assignment) { 
        
               PersistentTask<?> taskInProgress = tasks.get(taskId); 
        
               if (taskInProgress != null) { 
        
                   changed = true; 
        
                   tasks.put(taskId, new PersistentTask<>(taskInProgress, getNextAllocationId(), assignment)); 
        
               } else { 
        
                   throw new ResourceNotFoundException("cannot reassign task with id {" + taskId + "}, the task no longer exists"); 
        
               } 
        
               return this; 
        
           }

It seems that the reassigned task is called to execute again, which would call AutodetectProcessManager.openJob downstream.

davidkyle · 2019-01-15T21:17:16Z

Given that AutodetectProcessManager.openJob will be called by the task executor which is called by the code above so it probably makes more sense to add the alias check to OpenJobPersistentTasksExecutor

https://github.com/elastic/elasticsearch/blob/master/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportOpenJobAction.java#L739

davidkyle · 2019-01-15T21:19:00Z

I think I commented at exactly the same time you pushed a commit sorry

benwtrent · 2019-01-16T13:46:59Z

Jenkins retest this please

droberts195

Moving the check to AutodetectProcessManager.openJob() will solve the problem of rolling upgrades with open jobs.

I left two other comments about places where we could write to the state index without having an open job at all, i.e. neither left open during rolling upgrade nor opened afterwards.

The migration assistant should MOVE aliases over

Yes, it will be changed to do that - see elastic/kibana#26368 (comment) - but consider this sequence of events:

Customer first used ML in 5.x
Customer upgrades to 6.6
Customer closes all ML jobs
All ML job configs are moved to indices
Customer upgrades to 6.7
Customer runs migration assistant and reindexes .ml-state
Customer now has a .ml-state-reindexed-6 index and a .ml-state alias
Customer opens an ML job

In step 8 we'll try to create the .ml-state-write alias pointing at the .ml-state index, when .ml-state is an alias at this point.

droberts195 · 2019-01-16T13:07:14Z

.../plugin/ml/src/main/java/org/elasticsearch/xpack/ml/job/persistence/JobResultsPersister.java

@@ -237,7 +237,7 @@ public void persistQuantiles(Quantiles quantiles) {
    public void persistQuantiles(Quantiles quantiles, WriteRequest.RefreshPolicy refreshPolicy, ActionListener<IndexResponse> listener) {


This method can be called when reverting a model snapshot, and there's no guarantee that an autodetect process will have been started on the newer version of the product at the point when a model snapshot is reverted. The call chain is TransportRevertModelSnapshotAction.masterOperation() -> JobManager.revertSnapshot() -> this method. So one of those two calls needs to call AnomalyDetectorsIndex.createStateIndexAndAliasIfNecessary() first.

droberts195 · 2019-01-16T13:10:05Z

x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/MlConfigMigrator.java

@@ -347,7 +347,7 @@ public void snapshotMlMeta(MlMetadata mlMetadata, ActionListener<Boolean> listen

        logger.debug("taking a snapshot of ml_metadata");
        String documentId = "ml-config";
-        IndexRequestBuilder indexRequest = client.prepareIndex(AnomalyDetectorsIndex.jobStateIndexName(),
+        IndexRequestBuilder indexRequest = client.prepareIndex(AnomalyDetectorsIndex.jobStateIndexWriteAlias(),


There's no guarantee that an autodetect process will have been started on the newer version of the product at the point when this call is made - if all ML jobs are closed prior to upgrading from 6.5 to 6.7 then that will definitely trigger this situation. So this method needs to call AnomalyDetectorsIndex.createStateIndexAndAliasIfNecessary() first.

droberts195 · 2019-01-16T13:50:02Z

...ore/src/main/java/org/elasticsearch/xpack/core/ml/job/persistence/AnomalyDetectorsIndex.java

+
+        // Only create the index or aliases if some other ML index exists - saves clutter if ML is never used.
+        SortedMap<String, AliasOrIndex> mlLookup = state.getMetaData().getAliasAndIndexLookup().tailMap(".ml");
+        if (mlLookup.isEmpty() == false && mlLookup.firstKey().startsWith(".ml")) {


We probably shouldn't do the clutter avoidance in this case. If this method is called we know we're intending to write to an ML index shortly even if we never have up to this time. The effect of not creating the alias is pretty bad - it results in creation of a concrete index called .ml-state-write, and that is hard to switch over to an alias of the same name.

Definitely, I agree. Additionally, This method should look for the concrete indices that match the prefix .ml-state*. Should be easy enough to adjust.

droberts195 · 2019-01-17T15:10:08Z

...ore/src/main/java/org/elasticsearch/xpack/core/ml/job/persistence/AnomalyDetectorsIndex.java

+        );
+
+        IndexNameExpressionResolver indexNameExpressionResolver = new IndexNameExpressionResolver();
+        String[] state_indices = indexNameExpressionResolver.concreteIndexNames(state,


nit: state_indices should be stateIndices

droberts195 · 2019-01-17T15:14:15Z

...ore/src/main/java/org/elasticsearch/xpack/core/ml/job/persistence/AnomalyDetectorsIndex.java

+        if (state_indices.length > 0) {
+            List<String> indices = Arrays.asList(state_indices);
+            indices.sort(String::compareTo);
+            createAliasListener.onResponse(indices.get(indices.size() - 1));


Instead of creating the temporary list just for sorting you could sort the array directly:

Arrays.sort(stateIndices); createAliasListener.onResponse(stateIndices[stateIndices.length - 1]);

or:

Arrays.sort(stateIndices, Collections.reverseOrder()); createAliasListener.onResponse(stateIndices[0]);

or:

createAliasListener.onResponse(Arrays.stream(stateIndices).max(String::compareTo).get());

Lulz, can't believe I missed that.

droberts195

LGTM

benwtrent · 2019-01-17T20:22:25Z

run gradle build tests 1

benwtrent · 2019-01-17T20:22:45Z

run gradle build tests 2

droberts195 · 2019-01-18T13:33:31Z

run gradle build tests 1

droberts195 · 2019-01-18T13:33:41Z

run gradle build tests 2

droberts195 · 2019-01-18T13:36:46Z

run docbldesx

benwtrent · 2019-01-18T14:13:39Z

run gradle build tests 2

benwtrent · 2019-01-18T15:53:17Z

run gradle build tests 1

benwtrent · 2019-01-18T16:17:02Z

run gradle build tests 1

* ML: creating ML State write alias and pointing writes there * Moving alias check to openJob method * adjusting concrete index lookup for ml-state

* elastic/master: (104 commits) Permission for restricted indices (elastic#37577) Remove Watcher Account "unsecure" settings (elastic#36736) Add cache cleaning task for ML snapshot (elastic#37505) Update jdk used by the docker builds (elastic#37621) Remove an unused constant in PutMappingRequest. Update get users to allow unknown fields (elastic#37593) Do not add index event listener if CCR disabled (elastic#37432) Add local session timeouts to leader node (elastic#37438) Add some deprecation optimizations (elastic#37597) refactor inner geogrid classes to own class files (elastic#37596) Remove obsolete deprecation checks (elastic#37510) ML: Add support for single bucket aggs in Datafeeds (elastic#37544) ML: creating ML State write alias and pointing writes there (elastic#37483) Deprecate types in the put mapping API. (elastic#37280) [ILM] Add unfollow action (elastic#36970) Packaging: Update marker used to allow ELASTIC_PASSWORD (elastic#37243) Fix setting openldap realm ssl config Document the need for JAVA11_HOME (elastic#37589) SQL: fix object extraction from sources (elastic#37502) Nit in settings.gradle for Eclipse ...

ML: creating ML State write alias and pointing writes there

9ccd34e

benwtrent added v7.0.0 >refactoring :ml Machine learning v6.7.0 labels Jan 15, 2019

style fix

a98b2d3

Moving alias check to openJob method

e194d8e

droberts195 reviewed Jan 16, 2019

View reviewed changes

benwtrent and others added 4 commits January 16, 2019 12:10

Addressing PR comments

a8bbd2e

Merge branch 'master' into feature/ml-add-state-write-alias

5754435

updating for master merge

694bc86

Fixing tests after master merge

77749e6

droberts195 reviewed Jan 17, 2019

View reviewed changes

benwtrent added 2 commits January 17, 2019 11:19

adjusting concrete index lookup for ml-state

3974ec9

Merge branch 'master' into feature/ml-add-state-write-alias

eb50b03

droberts195 approved these changes Jan 17, 2019

View reviewed changes

Merge branch 'master' into feature/ml-add-state-write-alias

bf23c35

droberts195 mentioned this pull request Jan 18, 2019

[ML] Update index mappings on process start, not job open #37607

Closed

benwtrent merged commit 5384162 into elastic:master Jan 18, 2019

benwtrent deleted the feature/ml-add-state-write-alias branch January 18, 2019 20:32

droberts195 mentioned this pull request Jan 23, 2019

[ML] Update ML results mappings on process start #37758

Merged

benwtrent mentioned this pull request Jan 30, 2019

[ML] Job opening fails during .ml-state creation #36271

Closed

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

droberts195 mentioned this pull request Jul 6, 2020

[CI] upgraded_cluster/30_ml_jobs_crud/Test open old jobs failed with .ml-state-write not an alias #59011

Closed

droberts195 mentioned this pull request Jul 3, 2023

[CI] UpgradeClusterClientYamlTestSuiteIT test {p0=upgraded_cluster/30_ml_jobs_crud/Test open old jobs} failing #97323

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ML: creating ML State write alias and pointing writes there #37483

ML: creating ML State write alias and pointing writes there #37483

benwtrent commented Jan 15, 2019

elasticmachine commented Jan 15, 2019

droberts195 commented Jan 15, 2019

benwtrent commented Jan 15, 2019

davidkyle commented Jan 15, 2019

benwtrent commented Jan 15, 2019 •

edited

Loading

davidkyle commented Jan 15, 2019

davidkyle commented Jan 15, 2019

benwtrent commented Jan 16, 2019

droberts195 left a comment

droberts195 Jan 16, 2019

droberts195 Jan 16, 2019

droberts195 Jan 16, 2019

benwtrent Jan 16, 2019

droberts195 Jan 17, 2019

droberts195 Jan 17, 2019

benwtrent Jan 17, 2019

droberts195 left a comment

benwtrent commented Jan 17, 2019

benwtrent commented Jan 17, 2019

droberts195 commented Jan 18, 2019

droberts195 commented Jan 18, 2019

droberts195 commented Jan 18, 2019

benwtrent commented Jan 18, 2019

benwtrent commented Jan 18, 2019

benwtrent commented Jan 18, 2019

		@@ -237,7 +237,7 @@ public void persistQuantiles(Quantiles quantiles) {
		public void persistQuantiles(Quantiles quantiles, WriteRequest.RefreshPolicy refreshPolicy, ActionListener<IndexResponse> listener) {

ML: creating ML State write alias and pointing writes there #37483

ML: creating ML State write alias and pointing writes there #37483

Conversation

benwtrent commented Jan 15, 2019

elasticmachine commented Jan 15, 2019

droberts195 commented Jan 15, 2019

benwtrent commented Jan 15, 2019

davidkyle commented Jan 15, 2019

benwtrent commented Jan 15, 2019 • edited Loading

davidkyle commented Jan 15, 2019

davidkyle commented Jan 15, 2019

benwtrent commented Jan 16, 2019

droberts195 left a comment

Choose a reason for hiding this comment

droberts195 Jan 16, 2019

Choose a reason for hiding this comment

droberts195 Jan 16, 2019

Choose a reason for hiding this comment

droberts195 Jan 16, 2019

Choose a reason for hiding this comment

benwtrent Jan 16, 2019

Choose a reason for hiding this comment

droberts195 Jan 17, 2019

Choose a reason for hiding this comment

droberts195 Jan 17, 2019

Choose a reason for hiding this comment

benwtrent Jan 17, 2019

Choose a reason for hiding this comment

droberts195 left a comment

Choose a reason for hiding this comment

benwtrent commented Jan 17, 2019

benwtrent commented Jan 17, 2019

droberts195 commented Jan 18, 2019

droberts195 commented Jan 18, 2019

droberts195 commented Jan 18, 2019

benwtrent commented Jan 18, 2019

benwtrent commented Jan 18, 2019

benwtrent commented Jan 18, 2019

benwtrent commented Jan 15, 2019 •

edited

Loading