Ignore liveness update of current node if it is actually alive #1264

jsjtzyy · 2019-09-23T16:57:02Z

When starting ambry-server, HelixClusterManager is instantiated before
participating into cluster. During this short window, Helix may send
notification that makes this server mark itself is down. This brings
Disk_Unavailable errors when handling replication requests and checking
the state of disk(disk state depends on node state). Such error is
misleading because disk is actually good and the server is able to serve
replication request. Hence, this PR makes node ignore liveness update of
itself if it is actually alive. (It should be safe, because other
frontends and servers will mark it down and no subsequent requests will
be routed to it)

When starting ambry-server, HelixClusterManager is instantiated before participating into cluster. During this short window, Helix may send notification that makes this server mark itself is down. This brings Disk_Unavailable errors when handling replication requests and checking the state of disk(disk state depends on node state). Such error is misleading because disk is actually good and the server is able to serve replication request. Hence, this PR makes node ignore liveness update of itself if it is actually alive. (It should be safe, because other frontends and servers will mark it down and no subsequent requests will be routed to it)

codecov-io · 2019-09-23T17:33:54Z

Codecov Report

Merging #1264 into master will increase coverage by 0.05%.
The diff coverage is 100%.

@@             Coverage Diff             @@
##             master   #1264      +/-   ##
===========================================
+ Coverage     72.35%   72.4%   +0.05%     
- Complexity     6160    6163       +3     
===========================================
  Files           444     444              
  Lines         35364   35364              
  Branches       4491    4491              
===========================================
+ Hits          25586   25606      +20     
+ Misses         8602    8587      -15     
+ Partials       1176    1171       -5

Impacted Files	Coverage Δ	Complexity Δ
...m.github.ambry.clustermap/HelixClusterManager.java	`86.5% <100%> (+0.25%)`	`43 <1> (+1)`	⬆️
.../main/java/com.github.ambry.store/ScanResults.java	`81.25% <0%> (-1.57%)`	`16% <0%> (-1%)`
...va/com.github.ambry.replication/ReplicaThread.java	`86.38% <0%> (-0.95%)`	`90% <0%> (-1%)`
...java/com.github.ambry.router/GetBlobOperation.java	`90.73% <0%> (-0.72%)`	`39% <0%> (ø)`
...java/com.github.ambry.store/CompactionManager.java	`91.66% <0%> (-0.6%)`	`25% <0%> (ø)`
...ain/java/com.github.ambry.router/PutOperation.java	`91.55% <0%> (-0.17%)`	`113% <0%> (-1%)`
...in/java/com.github.ambry.network/SocketServer.java	`84.29% <0%> (+0.41%)`	`15% <0%> (ø)`	⬇️
...rc/main/java/com.github.ambry.store/BlobStore.java	`90.46% <0%> (+0.46%)`	`98% <0%> (+2%)`	⬆️
...in/java/com.github.ambry.store/BlobStoreStats.java	`70.83% <0%> (+1.45%)`	`103% <0%> (+1%)`	⬆️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 76fc882...2e4ded1. Read the comment docs.

lightningrob

LGTM. I would shorten the explanation about the disk errors.

lightningrob · 2019-09-24T02:17:57Z

ambry-clustermap/src/main/java/com.github.ambry.clustermap/HelixClusterManager.java

+          // Disk_Unavailable errors when handling replication requests and checking the state of disk (disk state depends
+          // on node state). The Disk_Unavailable is misleading, because disk is actually good and the server is able to
+          // serve replication request. Hence, if instance name equals self instance name, cluster manager of this node
+          // ignores its own liveness notification from Helix to avoid incorrect Disk_Unavailable error.


I don't think we need this much explanation of what we saw without the fix. It's enough to say that the list of live instances doesn't include this node since it hasn't joined yet.

jsjtzyy self-assigned this Sep 23, 2019

cleaned code and fixed unit test

ca557b5

jsjtzyy requested review from justinlin-linkedin and lightningrob September 23, 2019 18:14

lightningrob approved these changes Sep 24, 2019

View reviewed changes

addressed comment

2e4ded1

justinlin-linkedin approved these changes Sep 24, 2019

View reviewed changes

justinlin-linkedin merged commit 76fab02 into linkedin:master Sep 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ignore liveness update of current node if it is actually alive #1264

Ignore liveness update of current node if it is actually alive #1264

jsjtzyy commented Sep 23, 2019

codecov-io commented Sep 23, 2019 •

edited

Loading

lightningrob left a comment

lightningrob Sep 24, 2019

Ignore liveness update of current node if it is actually alive #1264

Ignore liveness update of current node if it is actually alive #1264

Conversation

jsjtzyy commented Sep 23, 2019

codecov-io commented Sep 23, 2019 • edited Loading

Codecov Report

lightningrob left a comment

Choose a reason for hiding this comment

lightningrob Sep 24, 2019

Choose a reason for hiding this comment

codecov-io commented Sep 23, 2019 •

edited

Loading