Skip to content
This repository has been archived by the owner on Nov 24, 2023. It is now read-only.

dm/master, dm/worker: handle etcd retryable errors for watcher #499

Merged
merged 18 commits into from
Mar 3, 2020

Conversation

lichunzhu
Copy link
Contributor

@lichunzhu lichunzhu commented Feb 28, 2020

What problem does this PR solve?

#501
When receiving etcd compaction error from wather, the handle goroutine will lose event if we don't do anything.

What is changed and how it works?

  1. Rebuild info and restart handle etcd event routine when receiving etcd retryable error (ErrCompacted, ErrNoLeader, ErrNoSpace).
  2. Return resp.Header.Revision even if we don't get value from etcd server.

Check List

Tests

  • Unit test
  • Integration test

Side effects

  • Increased code complexity

Related changes

  • Need to cherry-pick to the release branch
  • Need to update the dm/dm-ansible

@lichunzhu lichunzhu added the priority/normal Minor change, requires approval from ≥1 primary reviewer label Feb 28, 2020
@lichunzhu
Copy link
Contributor Author

/run-all-tests tidb=release-3.0

@lichunzhu lichunzhu added the status/WIP This PR is still work in progress label Feb 28, 2020
@codecov
Copy link

codecov bot commented Feb 28, 2020

Codecov Report

Merging #499 into master will not change coverage by %.
The diff coverage is n/a.

@@             Coverage Diff             @@
##             master       #499   +/-   ##
===========================================
  Coverage   56.2980%   56.2980%           
===========================================
  Files           184        184           
  Lines         18585      18585           
===========================================
  Hits          10463      10463           
  Misses         7072       7072           
  Partials       1050       1050           

@lichunzhu lichunzhu added status/PTAL This PR is ready for review. Add this label back after committing new changes and removed status/WIP This PR is still work in progress labels Feb 28, 2020
s.handleWorkerEv(ctx, workerEvCh, workerErrCh)
}()
// starting to observe status of DM-worker instances.
s.observeWorkerEvent(ctx, etcdCli, rev1)
Copy link
Member

@csuzhangxc csuzhangxc Mar 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to handle the returned error for observeWorkerEvent? (or add a TODO for later handling, like retire from the leader...)

Copy link
Contributor Author

@lichunzhu lichunzhu Mar 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to add a TODO here. Now s.observeWorkerEvent will not return other error because we don't return error in s.handleWorkerEvent. This can be done after we classify the retryable errors and fatal errors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

@lichunzhu
Copy link
Contributor Author

/run-all-tests tidb=release-3.0

1 similar comment
@lichunzhu
Copy link
Contributor Author

/run-all-tests tidb=release-3.0

return nil
case <-time.After(500 * time.Millisecond):
rev, err = s.resetWorkerEv(etcdCli)
if err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to check this err is retryable here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all the errors are retryable here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

including errors returned from ha.GetKeepAliveWorkers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Maybe we don't run ha.GetKeepAliveWorkers successfully now, but we may succeed in the future.

dm/master/scheduler/scheduler.go Show resolved Hide resolved
dm/master/scheduler/scheduler_test.go Show resolved Hide resolved
dm/master/scheduler/scheduler.go Outdated Show resolved Hide resolved
dm/master/scheduler/scheduler.go Outdated Show resolved Hide resolved
dm/worker/worker.go Outdated Show resolved Hide resolved
dm/worker/worker.go Outdated Show resolved Hide resolved
@lichunzhu
Copy link
Contributor Author

/run-all-tests tidb=release-3.0

dm/worker/server.go Outdated Show resolved Hide resolved
@lichunzhu lichunzhu changed the title dm/master, dm/worker: handle etcd compaction dm/master, dm/worker: handle etcd retryable errors for watcher Mar 2, 2020
@lichunzhu
Copy link
Contributor Author

/run-all-tests tidb=release-3.0

Copy link
Contributor

@WangXiangUSTC WangXiangUSTC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@WangXiangUSTC WangXiangUSTC added status/LGT1 One reviewer already commented LGTM and removed status/PTAL This PR is ready for review. Add this label back after committing new changes labels Mar 3, 2020
Copy link
Member

@csuzhangxc csuzhangxc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rest LGTM

dm/worker/worker.go Outdated Show resolved Hide resolved
dm/worker/worker.go Outdated Show resolved Hide resolved
dm/worker/worker.go Outdated Show resolved Hide resolved
dm/worker/worker_test.go Outdated Show resolved Hide resolved
dm/worker/worker_test.go Outdated Show resolved Hide resolved
dm/worker/worker_test.go Show resolved Hide resolved
@lichunzhu
Copy link
Contributor Author

/run-all-tests tidb=release-3.0

1 similar comment
@lichunzhu
Copy link
Contributor Author

/run-all-tests tidb=release-3.0

Copy link
Member

@csuzhangxc csuzhangxc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@csuzhangxc csuzhangxc added status/LGT2 Two reviewers already commented LGTM, ready for merge and removed status/LGT1 One reviewer already commented LGTM labels Mar 3, 2020
@lichunzhu lichunzhu merged commit c2da436 into pingcap:master Mar 3, 2020
@lichunzhu lichunzhu deleted the ha/handle-etcd-compact branch March 3, 2020 12:26
lichunzhu added a commit that referenced this pull request Mar 9, 2020
…evision` (#518)

* Reset get error in watcher like we did in #499.
* Get relative etcd info in one etcd transaction to avoid using get with rev.
lichunzhu added a commit to lichunzhu/dm that referenced this pull request Apr 6, 2020
…ap#499)

* Rebuild info and restart handle etcd event routine when receiving etcd retryable error (ErrCompacted, ErrNoLeader, ErrNoSpace).
* Return resp.Header.Revision even if we don't get value from etcd server.
lichunzhu added a commit to lichunzhu/dm that referenced this pull request Apr 6, 2020
…evision` (pingcap#518)

* Reset get error in watcher like we did in pingcap#499.
* Get relative etcd info in one etcd transaction to avoid using get with rev.
lichunzhu added a commit to lichunzhu/dm that referenced this pull request Apr 6, 2020
…ap#499)

* Rebuild info and restart handle etcd event routine when receiving etcd retryable error (ErrCompacted, ErrNoLeader, ErrNoSpace).
* Return resp.Header.Revision even if we don't get value from etcd server.
lichunzhu added a commit to lichunzhu/dm that referenced this pull request Apr 6, 2020
…evision` (pingcap#518)

* Reset get error in watcher like we did in pingcap#499.
* Get relative etcd info in one etcd transaction to avoid using get with rev.
lichunzhu added a commit to lichunzhu/dm that referenced this pull request Apr 7, 2020
…ap#499)

* Rebuild info and restart handle etcd event routine when receiving etcd retryable error (ErrCompacted, ErrNoLeader, ErrNoSpace).
* Return resp.Header.Revision even if we don't get value from etcd server.
lichunzhu added a commit to lichunzhu/dm that referenced this pull request Apr 7, 2020
…evision` (pingcap#518)

* Reset get error in watcher like we did in pingcap#499.
* Get relative etcd info in one etcd transaction to avoid using get with rev.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
priority/normal Minor change, requires approval from ≥1 primary reviewer status/LGT2 Two reviewers already commented LGTM, ready for merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants