Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Un-assign persistent tasks as nodes exit the cluster #37656

Merged
merged 2 commits into from
Jan 22, 2019

Conversation

davidkyle
Copy link
Member

PersistentTasksClusterService decides if a task should be reassigned by checking there is a node in the cluster with the same Id instead of comparing the ephemeral Id or the actual DiscoveryNode.

https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/persistent/PersistentTasksClusterService.java#L389

During a rolling restart PersistentTasksClusterService may not observe a node exit and re-enter the cluster. When the node returns the assignment appears valid as a node with the correct Id exists in the cluster but its departure and hence the un-assignment of the task was missed. This happens most often during the rolling upgrade tests where there are 3 nodes and minimum master nodes is set to 3. The solution is to un-assign the persistent tasks when the executor node leaves the cluster.

There is a spelling mistake in the use of deassociate... the correct options are disassociate or dissociate. I've followed the existing pattern for this change and will correct the spelling in a follow up PR.

@davidkyle davidkyle added :Distributed Coordination/Task Management Issues for anything around the Tasks API - both persistent and node level. v7.0.0 v6.7.0 labels Jan 21, 2019
@davidkyle davidkyle requested a review from ywelsch January 21, 2019 14:11
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


public void testDisassociateDeadNodes_givenAssignedPersistentTask() {
DiscoveryNodes nodes = DiscoveryNodes.builder()
.add(new DiscoveryNode("node1", new TransportAddress(InetAddress.getLoopbackAddress(), 9300), Version.CURRENT))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just use buildNewFakeTransportAddress() (comes from ESTestCase)

@davidkyle davidkyle force-pushed the unassign-ptask-on-node-loss branch from f48d79d to 1616eaa Compare January 22, 2019 09:58
@davidkyle davidkyle merged commit 3fad1ee into elastic:master Jan 22, 2019
@davidkyle davidkyle deleted the unassign-ptask-on-node-loss branch January 22, 2019 12:44
davidkyle added a commit that referenced this pull request Jan 22, 2019
PersistentTasksClusterService decides if a task should be reassigned by
checking there is a node in the cluster with the same Id. If a node is
restarted PersistentTasksClusterService may not observe the change and
decide the task still has a valid assignment because the node's
ephemeral Id is not used in that decision. This change un-assigns tasks
as the nodes in the cluster change.
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Jan 22, 2019
* elastic/master: (43 commits)
  Remove remaining occurances of "include_type_name=true" in docs (elastic#37646)
  SQL: Return Intervals in SQL format for CLI (elastic#37602)
  Publish to masters first (elastic#37673)
  Un-assign persistent tasks as nodes exit the cluster (elastic#37656)
  Fail start of non-data node if node has data (elastic#37347)
  Use cancel instead of timeout for aborting publications (elastic#37670)
  Follow stats api should return a 404 when requesting stats for a non existing index (elastic#37220)
  Remove deprecated FieldNamesFieldMapper.Builder#index (elastic#37305)
  Document that date math is locale independent
  Bootstrap a Zen2 cluster once quorum is discovered (elastic#37463)
  Upgrade to lucene-8.0.0-snapshot-83f9835. (elastic#37668)
  Mute failing test
  Fix java time formatters that round up (elastic#37604)
  Removes awaits fix as the fix is in. (elastic#37676)
  Mute failing test
  Ensure that max seq # is equal to the global checkpoint when creating ReadOnlyEngines (elastic#37426)
  Mute failing discovery disruption tests
  Add note about how the body is referenced (elastic#33935)
  Define constants for REST requests endpoints in tests (elastic#37610)
  Make prepare engine step of recovery source non-blocking (elastic#37573)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Coordination/Task Management Issues for anything around the Tasks API - both persistent and node level. v6.7.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants