Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ML: creating ML State write alias and pointing writes there #37483

Merged
merged 10 commits into from
Jan 18, 2019

Conversation

benwtrent
Copy link
Member

This adds an alias .ml-state-write in front of .ml-state. This way when we roll .ml-state in the future, we can easily redirect the writes without downtime.

I put the alias + index creation in the TransportOpenJobAction because:

  • We need to make sure that if the node running the job is the ONLY one updated, that it can immediately write to the appropriate alias
  • Using a cluster state watcher can run into a race condition (I saw this in tests) where .ml-state-write would end up being created dynamically as a concrete index because a Job wrote state info before the cluster state watcher could create the alias.
  • There is no point in manually creating this alias and index unless there is an open job.

@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

@droberts195
Copy link
Contributor

I put the alias + index creation in the TransportOpenJobAction

Where to create the alias is a harder problem than I realised. The problem with creating it in TransportOpenJobAction is that a cluster with opened jobs could be upgraded to 6.7 in a rolling upgrade, without any new jobs being opened. The already-opened jobs would at some point be reassigned to the upgraded 6.7 nodes which would expect to be able to use the alias but it wouldn't exist.

Another problem now we've decided to use the migration assistant reindexing for old 5.x indices is that .ml-state might be an alias pointing at a reindexed index chosen by the migration assistant, say .ml-state-reindexed-6. So in that case we'll need to instead create the .ml-state-write alias on the .ml-state-reindexed-6.

It's almost as though every single write to the state index needs to be wrapped in a method that:

  1. Checks the current cluster state to see if an .ml-state-write alias exists - if so proceed to the actual state write
  2. Checks the current cluster state to see if .ml-state is an alias - if so create an alias .ml-state-write pointing at the same index that .ml-state points to, then proceed to the actual state write
  3. Checks the current cluster state to see if .ml-state is an index - if so create an alias .ml-state-write pointing at .ml-state, then proceed to the actual state write
  4. Create an index .ml-state with alias .ml-state-write, then proceed to the actual state write

Because the 3 checks could be made against existing in-memory cluster state they should be fast enough that they don't cause significant overhead.

This is only an idea though. Let's see if anyone else has a better idea before making any code changes.

@benwtrent
Copy link
Member Author

So in that case we'll need to instead create the .ml-state-write alias on the .ml-state-reindexed-6.

The migration assistant should MOVE aliases over, if it will not, .ml-anomalies-* cannot be migrated by it.

As for the rolling upgrade problem, we COULD do the alias creation on Cluster state changes as migrating a persistent task requires that state change. However, there MAY be a race condition as the job is started and running after migrating before the state change is handled by whatever is creating the alias :(.

Is there a way to hook into the persistent task transfer to create the alias at that moment?

@davidkyle
Copy link
Member

As for the rolling upgrade problem...

The timing is difficult as if a job is left open we cannot predict when it will start up. That may happen even before the config migrator starts its work. We could move the current code in TransportOpenJobAction that creates the aliases to AutodetectProcessManager.openJob that way any job that starts will have to correct aliases.

@benwtrent
Copy link
Member Author

benwtrent commented Jan 15, 2019

AutodetectProcessManager.openJob Ah, yeah.

From what I can see here:

if (Objects.equals(tasks, previousTasks) == false || event.nodesChanged()) {
// We have some changes let's check if they are related to our node
String localNodeId = event.state().getNodes().getLocalNodeId();
Set<Long> notVisitedTasks = new HashSet<>(runningTasks.keySet());
if (tasks != null) {
for (PersistentTask<?> taskInProgress : tasks.tasks()) {
if (localNodeId.equals(taskInProgress.getExecutorNode())) {
Long allocationId = taskInProgress.getAllocationId();
AllocatedPersistentTask persistentTask = runningTasks.get(allocationId);
if (persistentTask == null) {
// New task - let's start it
startTask(taskInProgress);

And:

public Builder reassignTask(String taskId, Assignment assignment) {
PersistentTask<?> taskInProgress = tasks.get(taskId);
if (taskInProgress != null) {
changed = true;
tasks.put(taskId, new PersistentTask<>(taskInProgress, getNextAllocationId(), assignment));
} else {
throw new ResourceNotFoundException("cannot reassign task with id {" + taskId + "}, the task no longer exists");
}
return this;
}

It seems that the reassigned task is called to execute again, which would call AutodetectProcessManager.openJob downstream.

@davidkyle
Copy link
Member

Given that AutodetectProcessManager.openJob will be called by the task executor which is called by the code above so it probably makes more sense to add the alias check to OpenJobPersistentTasksExecutor

https://github.com/elastic/elasticsearch/blob/master/x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportOpenJobAction.java#L739

@davidkyle
Copy link
Member

I think I commented at exactly the same time you pushed a commit sorry

@benwtrent
Copy link
Member Author

Jenkins retest this please

Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving the check to AutodetectProcessManager.openJob() will solve the problem of rolling upgrades with open jobs.

I left two other comments about places where we could write to the state index without having an open job at all, i.e. neither left open during rolling upgrade nor opened afterwards.

The migration assistant should MOVE aliases over

Yes, it will be changed to do that - see elastic/kibana#26368 (comment) - but consider this sequence of events:

  1. Customer first used ML in 5.x
  2. Customer upgrades to 6.6
  3. Customer closes all ML jobs
  4. All ML job configs are moved to indices
  5. Customer upgrades to 6.7
  6. Customer runs migration assistant and reindexes .ml-state
  7. Customer now has a .ml-state-reindexed-6 index and a .ml-state alias
  8. Customer opens an ML job

In step 8 we'll try to create the .ml-state-write alias pointing at the .ml-state index, when .ml-state is an alias at this point.

@@ -237,7 +237,7 @@ public void persistQuantiles(Quantiles quantiles) {
public void persistQuantiles(Quantiles quantiles, WriteRequest.RefreshPolicy refreshPolicy, ActionListener<IndexResponse> listener) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method can be called when reverting a model snapshot, and there's no guarantee that an autodetect process will have been started on the newer version of the product at the point when a model snapshot is reverted. The call chain is TransportRevertModelSnapshotAction.masterOperation() -> JobManager.revertSnapshot() -> this method. So one of those two calls needs to call AnomalyDetectorsIndex.createStateIndexAndAliasIfNecessary() first.

@@ -347,7 +347,7 @@ public void snapshotMlMeta(MlMetadata mlMetadata, ActionListener<Boolean> listen

logger.debug("taking a snapshot of ml_metadata");
String documentId = "ml-config";
IndexRequestBuilder indexRequest = client.prepareIndex(AnomalyDetectorsIndex.jobStateIndexName(),
IndexRequestBuilder indexRequest = client.prepareIndex(AnomalyDetectorsIndex.jobStateIndexWriteAlias(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no guarantee that an autodetect process will have been started on the newer version of the product at the point when this call is made - if all ML jobs are closed prior to upgrading from 6.5 to 6.7 then that will definitely trigger this situation. So this method needs to call AnomalyDetectorsIndex.createStateIndexAndAliasIfNecessary() first.


// Only create the index or aliases if some other ML index exists - saves clutter if ML is never used.
SortedMap<String, AliasOrIndex> mlLookup = state.getMetaData().getAliasAndIndexLookup().tailMap(".ml");
if (mlLookup.isEmpty() == false && mlLookup.firstKey().startsWith(".ml")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably shouldn't do the clutter avoidance in this case. If this method is called we know we're intending to write to an ML index shortly even if we never have up to this time. The effect of not creating the alias is pretty bad - it results in creation of a concrete index called .ml-state-write, and that is hard to switch over to an alias of the same name.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely, I agree. Additionally, This method should look for the concrete indices that match the prefix .ml-state*. Should be easy enough to adjust.

);

IndexNameExpressionResolver indexNameExpressionResolver = new IndexNameExpressionResolver();
String[] state_indices = indexNameExpressionResolver.concreteIndexNames(state,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: state_indices should be stateIndices

if (state_indices.length > 0) {
List<String> indices = Arrays.asList(state_indices);
indices.sort(String::compareTo);
createAliasListener.onResponse(indices.get(indices.size() - 1));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of creating the temporary list just for sorting you could sort the array directly:

Arrays.sort(stateIndices);
createAliasListener.onResponse(stateIndices[stateIndices.length - 1]);

or:

Arrays.sort(stateIndices, Collections.reverseOrder());
createAliasListener.onResponse(stateIndices[0]);

or:

createAliasListener.onResponse(Arrays.stream(stateIndices).max(String::compareTo).get());

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lulz, can't believe I missed that.

Copy link
Contributor

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@benwtrent
Copy link
Member Author

run gradle build tests 1

@benwtrent
Copy link
Member Author

run gradle build tests 2

@droberts195
Copy link
Contributor

run gradle build tests 1

@droberts195
Copy link
Contributor

run gradle build tests 2

@droberts195
Copy link
Contributor

run docbldesx

@benwtrent
Copy link
Member Author

run gradle build tests 2

@benwtrent
Copy link
Member Author

run gradle build tests 1

1 similar comment
@benwtrent
Copy link
Member Author

run gradle build tests 1

@benwtrent benwtrent merged commit 5384162 into elastic:master Jan 18, 2019
@benwtrent benwtrent deleted the feature/ml-add-state-write-alias branch January 18, 2019 20:32
benwtrent added a commit that referenced this pull request Jan 18, 2019
* ML: creating ML State write alias and pointing writes there

* Moving alias check to openJob method

* adjusting concrete index lookup for ml-state
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Jan 21, 2019
* elastic/master: (104 commits)
  Permission for restricted indices (elastic#37577)
  Remove Watcher Account "unsecure" settings (elastic#36736)
  Add cache cleaning task for ML snapshot (elastic#37505)
  Update jdk used by the docker builds (elastic#37621)
  Remove an unused constant in PutMappingRequest.
  Update get users to allow unknown fields (elastic#37593)
  Do not add index event listener if CCR disabled (elastic#37432)
  Add local session timeouts to leader node (elastic#37438)
  Add some deprecation optimizations (elastic#37597)
  refactor inner geogrid classes to own class files (elastic#37596)
  Remove obsolete deprecation checks (elastic#37510)
  ML: Add support for single bucket aggs in Datafeeds (elastic#37544)
  ML: creating ML State write alias and pointing writes there (elastic#37483)
  Deprecate types in the put mapping API. (elastic#37280)
  [ILM] Add unfollow action (elastic#36970)
  Packaging: Update marker used to allow ELASTIC_PASSWORD (elastic#37243)
  Fix setting openldap realm ssl config
  Document the need for JAVA11_HOME (elastic#37589)
  SQL: fix object extraction from sources (elastic#37502)
  Nit in settings.gradle for Eclipse
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants