-
Notifications
You must be signed in to change notification settings - Fork 222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix the issue that health check may set liveness wrongly #1127
Conversation
Signed-off-by: zyguan <zhongyangguan@gmail.com>
Signed-off-by: zyguan <zhongyangguan@gmail.com>
Signed-off-by: zyguan <zhongyangguan@gmail.com>
Signed-off-by: zyguan <zhongyangguan@gmail.com>
Signed-off-by: zyguan <zhongyangguan@gmail.com>
@cfzjywxk PTAL |
Signed-off-by: zyguan <zhongyangguan@gmail.com>
Signed-off-by: zyguan <zhongyangguan@gmail.com>
Signed-off-by: zyguan <zhongyangguan@gmail.com>
/cc @crazycs520 PTAL |
newStore := &Store{storeID: s.storeID, addr: addr, peerAddr: store.GetPeerAddress(), saddr: store.GetStatusAddress(), storeType: storeType, labels: store.GetLabels(), state: uint64(resolved)} | ||
newStore.livenessState = atomic.LoadUint32(&s.livenessState) | ||
newStore.unreachableSince = s.unreachableSince |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about abstracting a newStore
function to force specify the livenessState
and unreachableSince
as the input parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about to handle it later (as a part of #1104). There are 4 &Store{...}
in region_cache.go now, I'd like to handle them togather according to the store lifecycle. Let this PR fix the corresponding issue by now.
@@ -2783,50 +2785,85 @@ func (s *Store) requestLivenessAndStartHealthCheckLoopIfNeeded(bo *retry.Backoff | |||
// It may be already started by another thread. | |||
if atomic.CompareAndSwapUint32(&s.livenessState, uint32(reachable), uint32(liveness)) { | |||
s.unreachableSince = time.Now() | |||
go s.checkUntilHealth(c) | |||
reResolveInterval := 30 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to make it a constant value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rawkv tests use failpoint to set it shorter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about
var DefReResolveInterval = 30 * time.Second // global scope.
func (s *Store) checkUntilHealth(c *RegionCache, liveness livenessState){
reResolveInterval = DefReResolveInterval
...
}
and if test needs a shorter reResolveInterval, change DefReResolveInterval
directly, no need to use the following failpoint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about
var DefReResolveInterval = 30 * time.Second // global scope. func (s *Store) checkUntilHealth(c *RegionCache, liveness livenessState){ reResolveInterval = DefReResolveInterval ... }and if test needs a shorter reResolveInterval, change
DefReResolveInterval
directly, no need to use the following failpoint.
Personally I do not prefer that. The global var is required to be handled carefully to avoid data race in ut. I've struggled with SetRegionCacheTTLSec
in #1122 for about an hour.
logutil.BgLogger().Info("[health check] store meta deleted, stop checking", zap.Uint64("storeID", s.storeID), zap.String("addr", s.addr)) | ||
if s.getResolveState() == deleted { | ||
// if the store is deleted, a new store with same id must be inserted (guaranteed by reResolve). | ||
newStore, _ := c.getStore(s.storeID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it necessary to add an assertion newStore != nil
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's guaranteed by reResolve
, changeToActiveStore
also rely on it.
Signed-off-by: zyguan <zhongyangguan@gmail.com>
* fix the issue that health check may set liveness wrongly Signed-off-by: zyguan <zhongyangguan@gmail.com> * fix lint issue Signed-off-by: zyguan <zhongyangguan@gmail.com> * fix rawkv ut Signed-off-by: zyguan <zhongyangguan@gmail.com> * fix data race Signed-off-by: zyguan <zhongyangguan@gmail.com> * use getStore instead of accessing storeMu directly Signed-off-by: zyguan <zhongyangguan@gmail.com> * make TestAccessFollowerAfter1TiKVDown stable Signed-off-by: zyguan <zhongyangguan@gmail.com> * make TestBackoffErrorType stable Signed-off-by: zyguan <zhongyangguan@gmail.com> * address comments Signed-off-by: zyguan <zhongyangguan@gmail.com> --------- Signed-off-by: zyguan <zhongyangguan@gmail.com> Co-authored-by: disksing <i@disksing.com> Signed-off-by: zyguan <zhongyangguan@gmail.com>
* fix the issue that health check may set liveness wrongly * fix lint issue * fix rawkv ut * fix data race * use getStore instead of accessing storeMu directly * make TestAccessFollowerAfter1TiKVDown stable * make TestBackoffErrorType stable * address comments --------- Signed-off-by: zyguan <zhongyangguan@gmail.com> Co-authored-by: disksing <i@disksing.com>
close #1111