-
Notifications
You must be signed in to change notification settings - Fork 188
dm/master, dm/worker: handle etcd retryable errors for watcher #499
Conversation
/run-all-tests tidb=release-3.0 |
Codecov Report
@@ Coverage Diff @@
## master #499 +/- ##
===========================================
Coverage 56.2980% 56.2980%
===========================================
Files 184 184
Lines 18585 18585
===========================================
Hits 10463 10463
Misses 7072 7072
Partials 1050 1050 |
s.handleWorkerEv(ctx, workerEvCh, workerErrCh) | ||
}() | ||
// starting to observe status of DM-worker instances. | ||
s.observeWorkerEvent(ctx, etcdCli, rev1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to handle the returned error
for observeWorkerEvent
? (or add a TODO for later handling, like retire from the leader...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to add a TODO here. Now s.observeWorkerEvent
will not return other error because we don't return error in s.handleWorkerEvent
. This can be done after we classify the retryable errors and fatal errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK
/run-all-tests tidb=release-3.0 |
1 similar comment
/run-all-tests tidb=release-3.0 |
return nil | ||
case <-time.After(500 * time.Millisecond): | ||
rev, err = s.resetWorkerEv(etcdCli) | ||
if err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to check this err
is retryable here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think all the errors are retryable here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
including errors returned from ha.GetKeepAliveWorkers
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Maybe we don't run ha.GetKeepAliveWorkers
successfully now, but we may succeed in the future.
/run-all-tests tidb=release-3.0 |
/run-all-tests tidb=release-3.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rest LGTM
/run-all-tests tidb=release-3.0 |
1 similar comment
/run-all-tests tidb=release-3.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…ap#499) * Rebuild info and restart handle etcd event routine when receiving etcd retryable error (ErrCompacted, ErrNoLeader, ErrNoSpace). * Return resp.Header.Revision even if we don't get value from etcd server.
…evision` (pingcap#518) * Reset get error in watcher like we did in pingcap#499. * Get relative etcd info in one etcd transaction to avoid using get with rev.
…ap#499) * Rebuild info and restart handle etcd event routine when receiving etcd retryable error (ErrCompacted, ErrNoLeader, ErrNoSpace). * Return resp.Header.Revision even if we don't get value from etcd server.
…evision` (pingcap#518) * Reset get error in watcher like we did in pingcap#499. * Get relative etcd info in one etcd transaction to avoid using get with rev.
…ap#499) * Rebuild info and restart handle etcd event routine when receiving etcd retryable error (ErrCompacted, ErrNoLeader, ErrNoSpace). * Return resp.Header.Revision even if we don't get value from etcd server.
…evision` (pingcap#518) * Reset get error in watcher like we did in pingcap#499. * Get relative etcd info in one etcd transaction to avoid using get with rev.
What problem does this PR solve?
#501
When receiving etcd compaction error from wather, the handle goroutine will lose event if we don't do anything.
What is changed and how it works?
ErrCompacted, ErrNoLeader, ErrNoSpace
).resp.Header.Revision
even if we don't get value from etcd server.Check List
Tests
Side effects
Related changes
dm/dm-ansible