-
Notifications
You must be signed in to change notification settings - Fork 275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ignore liveness update of current node if it is actually alive #1264
Conversation
When starting ambry-server, HelixClusterManager is instantiated before participating into cluster. During this short window, Helix may send notification that makes this server mark itself is down. This brings Disk_Unavailable errors when handling replication requests and checking the state of disk(disk state depends on node state). Such error is misleading because disk is actually good and the server is able to serve replication request. Hence, this PR makes node ignore liveness update of itself if it is actually alive. (It should be safe, because other frontends and servers will mark it down and no subsequent requests will be routed to it)
Codecov Report
@@ Coverage Diff @@
## master #1264 +/- ##
===========================================
+ Coverage 72.35% 72.4% +0.05%
- Complexity 6160 6163 +3
===========================================
Files 444 444
Lines 35364 35364
Branches 4491 4491
===========================================
+ Hits 25586 25606 +20
+ Misses 8602 8587 -15
+ Partials 1176 1171 -5
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I would shorten the explanation about the disk errors.
// Disk_Unavailable errors when handling replication requests and checking the state of disk (disk state depends | ||
// on node state). The Disk_Unavailable is misleading, because disk is actually good and the server is able to | ||
// serve replication request. Hence, if instance name equals self instance name, cluster manager of this node | ||
// ignores its own liveness notification from Helix to avoid incorrect Disk_Unavailable error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need this much explanation of what we saw without the fix. It's enough to say that the list of live instances doesn't include this node since it hasn't joined yet.
When starting ambry-server, HelixClusterManager is instantiated before
participating into cluster. During this short window, Helix may send
notification that makes this server mark itself is down. This brings
Disk_Unavailable errors when handling replication requests and checking
the state of disk(disk state depends on node state). Such error is
misleading because disk is actually good and the server is able to serve
replication request. Hence, this PR makes node ignore liveness update of
itself if it is actually alive. (It should be safe, because other
frontends and servers will mark it down and no subsequent requests will
be routed to it)