Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ignore liveness update of current node if it is actually alive #1264

Merged
merged 3 commits into from
Sep 24, 2019

Conversation

jsjtzyy
Copy link
Contributor

@jsjtzyy jsjtzyy commented Sep 23, 2019

When starting ambry-server, HelixClusterManager is instantiated before
participating into cluster. During this short window, Helix may send
notification that makes this server mark itself is down. This brings
Disk_Unavailable errors when handling replication requests and checking
the state of disk(disk state depends on node state). Such error is
misleading because disk is actually good and the server is able to serve
replication request. Hence, this PR makes node ignore liveness update of
itself if it is actually alive. (It should be safe, because other
frontends and servers will mark it down and no subsequent requests will
be routed to it)

When starting ambry-server, HelixClusterManager is instantiated before
participating into cluster. During this short window, Helix may send
notification that makes this server mark itself is down. This brings
Disk_Unavailable errors when handling replication requests and checking
the state of disk(disk state depends on node state). Such error is
misleading because disk is actually good and the server is able to serve
replication request. Hence, this PR makes node ignore liveness update of
itself if it is actually alive. (It should be safe, because other
frontends and servers will mark it down and no subsequent requests will
be routed to it)
@jsjtzyy jsjtzyy self-assigned this Sep 23, 2019
@codecov-io
Copy link

codecov-io commented Sep 23, 2019

Codecov Report

Merging #1264 into master will increase coverage by 0.05%.
The diff coverage is 100%.

Impacted file tree graph

@@             Coverage Diff             @@
##             master   #1264      +/-   ##
===========================================
+ Coverage     72.35%   72.4%   +0.05%     
- Complexity     6160    6163       +3     
===========================================
  Files           444     444              
  Lines         35364   35364              
  Branches       4491    4491              
===========================================
+ Hits          25586   25606      +20     
+ Misses         8602    8587      -15     
+ Partials       1176    1171       -5
Impacted Files Coverage Δ Complexity Δ
...m.github.ambry.clustermap/HelixClusterManager.java 86.5% <100%> (+0.25%) 43 <1> (+1) ⬆️
.../main/java/com.github.ambry.store/ScanResults.java 81.25% <0%> (-1.57%) 16% <0%> (-1%)
...va/com.github.ambry.replication/ReplicaThread.java 86.38% <0%> (-0.95%) 90% <0%> (-1%)
...java/com.github.ambry.router/GetBlobOperation.java 90.73% <0%> (-0.72%) 39% <0%> (ø)
...java/com.github.ambry.store/CompactionManager.java 91.66% <0%> (-0.6%) 25% <0%> (ø)
...ain/java/com.github.ambry.router/PutOperation.java 91.55% <0%> (-0.17%) 113% <0%> (-1%)
...in/java/com.github.ambry.network/SocketServer.java 84.29% <0%> (+0.41%) 15% <0%> (ø) ⬇️
...rc/main/java/com.github.ambry.store/BlobStore.java 90.46% <0%> (+0.46%) 98% <0%> (+2%) ⬆️
...in/java/com.github.ambry.store/BlobStoreStats.java 70.83% <0%> (+1.45%) 103% <0%> (+1%) ⬆️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 76fc882...2e4ded1. Read the comment docs.

Copy link
Contributor

@lightningrob lightningrob left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I would shorten the explanation about the disk errors.

// Disk_Unavailable errors when handling replication requests and checking the state of disk (disk state depends
// on node state). The Disk_Unavailable is misleading, because disk is actually good and the server is able to
// serve replication request. Hence, if instance name equals self instance name, cluster manager of this node
// ignores its own liveness notification from Helix to avoid incorrect Disk_Unavailable error.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need this much explanation of what we saw without the fix. It's enough to say that the list of live instances doesn't include this node since it hasn't joined yet.

@justinlin-linkedin justinlin-linkedin merged commit 76fab02 into linkedin:master Sep 24, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants