Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uncouple persistent task state and status #31031

Merged
merged 6 commits into from
Jun 15, 2018

Conversation

tlrx
Copy link
Member

@tlrx tlrx commented Jun 1, 2018

This pull request removes the relationship between the state of persistent task (as stored in the cluster state) and the status of the task (as reported by the Task APIs and used in various places) that have been confusing for some time (#29608).

In order to do that, a new PersistentTaskState interface is added. This interface represents the persisted state of a persistent task. The methods used to update the state of persistent tasks are renamed: updatePersistentStatus() becomes updatePersistentTaskState() and now takes a PersistentTaskState as a parameter. The Task.Status type as been changed to PersistentTaskState in all places were it make sense (in persistent task customs in cluster state and all other methods that deal with the state of an allocated persistent task).

Since the task status and task state are both named writeables, the change should be backward compatible and does not require extra serialization/deserialization logic. A node could deserialize the state of a persistent task as a named writable xpack/ml/job of class Task.Status while a more recent node will deserialize the same object as a named writable xpack/ml/job of class PersistentTaskState.

This applies to all existing persistent tasks: RollUpJobTask, DatafeedTask, JobTask and upcoming ShardFollowNodeTask.

Machine Learning

Luckily the persistent tasks in ML never exposed their custom states (DatafeedState, JobTaskStatus) in the Task APIs. It means that the status reported in the Task API through the TaskInfo object has always been (and is still) the default PersistentTasksNodeService.Status. No change is needed, and DatafeedState and JobTaskStatus have been changed to implement PersistentTaskState instead of Task.Status. JobTaskStatus is also renamed to JobTaskState.

RollUp

RollUpJobTask uses the same RollupJobStatus for its status and its persisted state. Since it is about to be released in 6.3.0 it's too late to change this. Therefore RollupJobStatus has been changed to implement both Task.Status and PersistentTaskState and declared as such in named writebale registry, making it possible to be serialized/deserialized under both interfaces. It means that it does not require anything for bwc for now, but we maybe want to change that in the future as the role of RollupJobStatus is still confusing.

@tlrx tlrx added review :Distributed Coordination/Task Management Issues for anything around the Tasks API - both persistent and node level. v7.0.0 >refactoring v6.4.0 labels Jun 1, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@tlrx
Copy link
Member Author

tlrx commented Jun 1, 2018

Requiring a first review from Boaz, then I'd like to have Martijn and Zachary to have a look.

Copy link
Contributor

@bleskes bleskes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. I left some minor suggestions.

private final P params;
@Nullable
private final Status status;
private final @Nullable P params;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

params are not nullable any more - please make sure this doesn't revive it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

@@ -403,8 +397,9 @@ public XContentBuilder toXContent(XContentBuilder builder, ToXContent.Params xPa
if (params != null) {
builder.field("params", params, xParams);
}
if (status != null) {
builder.field("status", status, xParams);
if (state != null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we never use xcontent for cross node communication. I think we can use state and be tolerant of reading status when parsing? (until 7.0)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it should work. Rollup/ML jobs can be snapshotted as part of the global cluster state, but if it still parses status as a deprecated field it should be OK.

@@ -215,8 +215,8 @@ public void onFailure(Exception e) {
}
}


public static class Status implements Task.Status {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this one? How bad would it be to remove it and let the task implementation add one if they need to?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's the default status implementation for allocated persistent task and it just shows the internal state (started/completed/pending cancel) of the persistent task.

I think that we don't really need it and I'd prefer to see it removed. But this change was already big, and removing this class means to take care of some backward compatibility (this status is used as part of the TaskInfo object which is serialized over the network when using the Tasks API and also potentially persisted in TaskResult).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. makes sense

Copy link
Contributor

@bleskes bleskes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left one more bwc comment

@@ -94,8 +92,8 @@ public PersistentTasksCustomMetaData(long lastAllocationId, Map<String, Persiste
ObjectParser<TaskDescriptionBuilder<PersistentTaskParams>, String> parser = new ObjectParser<>("named");
parser.declareObject(TaskDescriptionBuilder::setParams,
(p, c) -> p.namedObject(PersistentTaskParams.class, c, null), new ParseField("params"));
parser.declareObject(TaskDescriptionBuilder::setStatus,
(p, c) -> p.namedObject(Status.class, c, null), new ParseField("status"));
parser.declareObject(TaskDescriptionBuilder::setState,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need the status too for bwc? can you see what it takes to have a BWC test which shuts down a cluster with a persistent task and then does a full restart and check it started again with the right state? This can be a follow up.

Copy link
Member Author

@tlrx tlrx Jun 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's already here with new ParseField("state", "status")? It produces a deprecation warning but allows to parses status as it was state. Since this is a named object, status will be parsed a as PersistentTaskState object and this is not an issue.

can you see what it takes to have a BWC test which shuts down a cluster with a persistent task and then does a full restart and check it started again with the right state?

Sure. In addition to this, I'd like to add a basic bwc unit test for this status/state renaming and the associated deprecation warning in 6.x.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's already here with new ParseField("state", "status")?

Heh. I don't know how I missed it. Sorry for the noise.

Sure. In addition to this, I'd like to add a basic bwc unit test for this status/state renaming and the associated deprecation warning in 6.x.

I think we can do this as a follow up?

Copy link
Contributor

@bleskes bleskes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -94,8 +92,8 @@ public PersistentTasksCustomMetaData(long lastAllocationId, Map<String, Persiste
ObjectParser<TaskDescriptionBuilder<PersistentTaskParams>, String> parser = new ObjectParser<>("named");
parser.declareObject(TaskDescriptionBuilder::setParams,
(p, c) -> p.namedObject(PersistentTaskParams.class, c, null), new ParseField("params"));
parser.declareObject(TaskDescriptionBuilder::setStatus,
(p, c) -> p.namedObject(Status.class, c, null), new ParseField("status"));
parser.declareObject(TaskDescriptionBuilder::setState,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's already here with new ParseField("state", "status")?

Heh. I don't know how I missed it. Sorry for the noise.

Sure. In addition to this, I'd like to add a basic bwc unit test for this status/state renaming and the associated deprecation warning in 6.x.

I think we can do this as a follow up?

@tlrx tlrx merged commit 992c788 into elastic:master Jun 15, 2018
@tlrx tlrx deleted the persistent-task-state branch June 15, 2018 07:59
tlrx added a commit that referenced this pull request Jun 15, 2018
This pull request removes the relationship between the state
of persistent task (as stored in the cluster state) and the status
of the task (as reported by the Task APIs and used in various
places) that have been confusing for some time (#29608).

In order to do that, a new PersistentTaskState interface is added.
This interface represents the persisted state of a persistent task.
The methods used to update the state of persistent tasks are
renamed: updatePersistentStatus() becomes updatePersistentTaskState()
and now takes a PersistentTaskState as a parameter. The
Task.Status type as been changed to PersistentTaskState in all
places were it make sense (in persistent task customs in cluster
state and all other methods that deal with the state of an allocated
persistent task).
tlrx added a commit that referenced this pull request Jun 15, 2018
* master:
  992c788 Uncouple persistent task state and status (#31031)
  8c6ee7d Describe how to add a plugin in Dockerfile (#31340)
  1c5cec0 Remove http status code maps (#31350)
  87a676e Do not set vm.max_map_count when unnecessary (#31285)
  e5b7137 TEST: getCapturedRequestsAndClear should be atomic (#31312)
  0324103 Painless: Fix bug for static method calls on interfaces (#31348)
  d6d0727 QA: Fix resolution of default distribution (#31351)
  fcf1e41 Extract common http logic to server (#31311)
  6dd81ea Build: Fix the license in the pom zip and tar (#31336)
  8f886cd Treat ack timeout more like a publish timeout (#31303)
  9b29327 [ML] Add description to ML filters (#31330)
  f7a0caf SQL: Fix build on Java 10
  375d09c [TEST] Fix RemoteClusterClientTests#testEnsureWeReconnect
  4877cec More detailed tracing when writing metadata (#31319)
  bbfe1ec [Tests] Mutualize fixtures code in BaseHttpFixture (#31210)
tlrx added a commit that referenced this pull request Jun 15, 2018
tlrx added a commit that referenced this pull request Jun 15, 2018
* 6.x:
  6d711fa (origin/6.x, 6.x) Uncouple persistent task state and status (#31031)
  f0f16b7 [TEST] Make SSL restrictions update atomic (#31050)
  652193f Describe how to add a plugin in Dockerfile (#31340)
  bb568ab TEST: getCapturedRequestsAndClear should be atomic (#31312)
  6f94914 Do not set vm.max_map_count when unnecessary (#31285)
  c12f3c7 Painless: Fix bug for static method calls on interfaces (#31348)
  21128e2 QA: Fix resolution of default distribution (#31351)
  df17a83 Build: Fix the license in the pom zip and tar (#31336)
  3e76b15 Treat ack timeout more like a publish timeout (#31303)
tlrx added a commit that referenced this pull request Jun 15, 2018
@tlrx
Copy link
Member Author

tlrx commented Jun 15, 2018

Thanks @bleskes. I'll open a follow up PR with the tests you asked for.

tlrx added a commit that referenced this pull request Jun 26, 2018
This pull request adds a full cluster restart test for a Rollup job. 
The test creates and starts a Rollup job on the cluster and checks 
that the job already exists and is correctly started on the upgraded 
cluster.

This test allows to test that the persistent task state is correctly 
parsed from the cluster state after the upgrade, as the status field 
has been renamed to state in #31031.

The test undercovers a ClassCastException that can be thrown in 
the RollupIndexer when the timestamp as a very low value that fits 
into an integer. When it's the case, the value is parsed back as an 
Integer instead of Long object and (long) position.get(rollupFieldName) 
fails.
tlrx added a commit that referenced this pull request Jun 26, 2018
This pull request adds a full cluster restart test for a Rollup job.
The test creates and starts a Rollup job on the cluster and checks
that the job already exists and is correctly started on the upgraded
cluster.

This test allows to test that the persistent task state is correctly
parsed from the cluster state after the upgrade, as the status field
has been renamed to state in #31031.

The test undercovers a ClassCastException that can be thrown in
the RollupIndexer when the timestamp as a very low value that fits
into an integer. When it's the case, the value is parsed back as an
Integer instead of Long object and (long) position.get(rollupFieldName)
fails.
tlrx added a commit that referenced this pull request Jun 26, 2018
This pull request adds a full cluster restart test for a Rollup job.
The test creates and starts a Rollup job on the cluster and checks
that the job already exists and is correctly started on the upgraded
cluster.

This test allows to test that the persistent task state is correctly
parsed from the cluster state after the upgrade, as the status field
has been renamed to state in #31031.

The test undercovers a ClassCastException that can be thrown in
the RollupIndexer when the timestamp as a very low value that fits
into an integer. When it's the case, the value is parsed back as an
Integer instead of Long object and (long) position.get(rollupFieldName)
fails.
@jimczi jimczi added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Task Management Issues for anything around the Tasks API - both persistent and node level. >refactoring v6.4.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants