Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-fetch shard info of primary when new node joins #47035

Merged
merged 11 commits into from
Sep 28, 2019

Conversation

dnhatn
Copy link
Member

@dnhatn dnhatn commented Sep 24, 2019

Today, we don't clear the shard info of the primary shard when a new node joins; then we might risk of making replica allocation decisions based on the stale information of the primary. The serious problem is that we can cancel the current recovery which is more advanced than the copy on the new node due to the old info we have from the primary.

With this change, we ensure the shard info from the primary is not older than any node when allocating replicas.

Relates #46959

This work was done by @henningandersen in #42518.
Co-authored-by: Henning Andersen henning.andersen@elastic.co

@dnhatn dnhatn added >enhancement :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) v8.0.0 v7.5.0 labels Sep 24, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@dnhatn
Copy link
Member Author

dnhatn commented Sep 24, 2019

Failure at [reference/cluster/health:36]: $body didn't match expected value:

This was fixed in #47016.

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I am not sure my review counts on this one 😃...

Copy link
Member

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some drive-by comments

Copy link
Member

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM :) Thanks Nhat!
Might be best for David or Yannick to look over this as well though. I understand what's going on here just fine now, but might miss some implication of this change.

@dnhatn
Copy link
Member Author

dnhatn commented Sep 27, 2019

@DaveCTurner @ywelsch This PR blocks the work in #46959. It would be great if one of you can take a look. Thank you!

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I suggested a comment and a few small changes.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks @dnhatn

@dnhatn
Copy link
Member Author

dnhatn commented Sep 27, 2019

Test failures are fixed in #47196.

@dnhatn
Copy link
Member Author

dnhatn commented Sep 28, 2019

@henningandersen @original-brownbear @DaveCTurner Thank you for reviewing.

@dnhatn dnhatn merged commit caaf02f into elastic:master Sep 28, 2019
@dnhatn dnhatn deleted the refetch-node-join branch September 28, 2019 02:21
dnhatn added a commit that referenced this pull request Oct 2, 2019
Today, we don't clear the shard info of the primary shard when a new
node joins; then we might risk of making replica allocation decisions
based on the stale information of the primary. The serious problem is
that we can cancel the current recovery which is more advanced than the
copy on the new node due to the old info we have from the primary.

With this change, we ensure the shard info from the primary is not older
than any node when allocating replicas.

Relates #46959

This work was done by Henning in #42518.

Co-authored-by: Henning Andersen <henning.andersen@elastic.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) >enhancement v7.5.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants