Skip to content
This repository has been archived by the owner on Nov 24, 2023. It is now read-only.

ha: refactor the schedule model #473

Merged
merged 125 commits into from
Feb 18, 2020
Merged
Changes from 1 commit
Commits
Show all changes
125 commits
Select commit Hold shift + click to select a range
48408ae
ha: add some design for the new HA model
csuzhangxc Feb 10, 2020
aa5b69f
Update pkg/ha/doc.go
csuzhangxc Feb 10, 2020
ca3191d
ha: add some design for the new HA model
csuzhangxc Feb 10, 2020
e6643f0
ha: add etcd operation sample for source
csuzhangxc Feb 11, 2020
c9000f2
ha: add etcd operation sample for source
csuzhangxc Feb 11, 2020
80964fa
add subtask
lichunzhu Feb 11, 2020
98a225e
address comments
lichunzhu Feb 11, 2020
4308f8f
ha: add etcd operation sample for dm-worker info
csuzhangxc Feb 11, 2020
4232b88
Merge branch 'ha-refactor' of github.com:pingcap/dm into ha-refactor
csuzhangxc Feb 11, 2020
a1b76b3
ha: update copyright year
csuzhangxc Feb 11, 2020
5953a80
ha: update wording
csuzhangxc Feb 11, 2020
b69de45
return map for subtask key
lichunzhu Feb 11, 2020
1dcff3b
Merge branch 'ha-refactor' of https://github.com/pingcap/dm into ha-r…
lichunzhu Feb 11, 2020
1ec15dc
fix bug
lichunzhu Feb 11, 2020
3119076
ha: add etcd operation for source bound
csuzhangxc Feb 11, 2020
c4c81d6
Merge remote-tracking branch 'remotes/origin/ha-dev' into ha-refactor
csuzhangxc Feb 11, 2020
7c97eed
add keepalive
lichunzhu Feb 11, 2020
580b3b6
refine code
lichunzhu Feb 11, 2020
5a0a934
fix hound
lichunzhu Feb 12, 2020
caa378a
fix make check
lichunzhu Feb 12, 2020
e0c07a2
ha: add etcd operation for stage
csuzhangxc Feb 12, 2020
91b1d01
Merge branch 'ha-refactor' of github.com:pingcap/dm into ha-refactor
csuzhangxc Feb 12, 2020
6ec665d
ha: report error for watcher through chan
csuzhangxc Feb 12, 2020
af3687c
add errCh, support multi put
lichunzhu Feb 12, 2020
ef8a5d7
Merge branch 'ha-refactor' of https://github.com/pingcap/dm into ha-r…
lichunzhu Feb 12, 2020
d012095
ha: add etcd operation for operations in one txn
csuzhangxc Feb 12, 2020
0a234be
Merge branch 'ha-refactor' of github.com:pingcap/dm into ha-refactor
csuzhangxc Feb 12, 2020
55cd37c
ha: refine code
csuzhangxc Feb 12, 2020
7c2fc72
refine code
lichunzhu Feb 12, 2020
0a5893e
support getting alive workers
lichunzhu Feb 12, 2020
428a94e
Merge branch 'ha-refactor' of https://github.com/pingcap/dm into ha-r…
lichunzhu Feb 12, 2020
eae1b1c
refine code
lichunzhu Feb 12, 2020
ed12d30
address comments
lichunzhu Feb 13, 2020
6ef1a7f
refine wait time
lichunzhu Feb 13, 2020
44b8d54
add revision support
lichunzhu Feb 13, 2020
d903258
add ut for rev
lichunzhu Feb 13, 2020
d27aaf7
support revision in stage
lichunzhu Feb 13, 2020
4727c07
refine stage revision test
lichunzhu Feb 13, 2020
d372f9f
support etcd operation for subtask and relay
lichunzhu Feb 13, 2020
cc47624
add revision for bound and source
lichunzhu Feb 14, 2020
6e91443
Merge branch 'ha-refactor' of https://github.com/pingcap/dm into ha-r…
lichunzhu Feb 14, 2020
b77d34e
ha: get all bound relationship
csuzhangxc Feb 14, 2020
bb6be0c
refine revision for event
lichunzhu Feb 14, 2020
cb0858f
Merge branch 'ha-refactor' of https://github.com/pingcap/dm into ha-r…
lichunzhu Feb 14, 2020
278046f
refine comments
lichunzhu Feb 14, 2020
4436cec
support etcd operations on dm-worker
lichunzhu Feb 14, 2020
9ac916c
Merge branch 'ha-refactor' of https://github.com/pingcap/dm into ha-r…
lichunzhu Feb 14, 2020
2ecdc56
fix check
lichunzhu Feb 14, 2020
535d059
scheduler: add worker agent; add scheduler skeleton
csuzhangxc Feb 14, 2020
b874b79
scheduler: handle source config; refine code
csuzhangxc Feb 14, 2020
295d020
scheduler: record bounds and unbounds; add test steps
csuzhangxc Feb 14, 2020
277f130
ha: add delete API for source bound
csuzhangxc Feb 15, 2020
ce49d96
scheduler: bound/unbound when the worker become online/offline
csuzhangxc Feb 15, 2020
cb4f16f
scheduler: refine tests
csuzhangxc Feb 15, 2020
935eccf
scheduler: put relay stage when put the first bound; add some code fo…
csuzhangxc Feb 15, 2020
eb0925f
scheduler: support update relay stage.
csuzhangxc Feb 15, 2020
e324398
scheduler: support add subtasks
csuzhangxc Feb 15, 2020
d9bea9d
ha: add get all for source config and relay stage.
csuzhangxc Feb 15, 2020
4198b1f
ha: add get all for subtask config
csuzhangxc Feb 15, 2020
5b744a6
scheduler: support update subtask; recover config and stage for sourc…
csuzhangxc Feb 15, 2020
e6f6a3e
refine worker UT
lichunzhu Feb 16, 2020
30d33e4
Merge branch 'ha-refactor' of https://github.com/pingcap/dm into ha-r…
lichunzhu Feb 16, 2020
73cee97
ha: address comments
csuzhangxc Feb 16, 2020
b65ece5
move decrypt config to init
lichunzhu Feb 16, 2020
f8d7c75
Merge branch 'ha-refactor' of https://github.com/pingcap/dm into ha-r…
lichunzhu Feb 16, 2020
964aeb6
*: refine bound source to worker
csuzhangxc Feb 16, 2020
2f7bff7
*: support remove subtask; update remove source
csuzhangxc Feb 16, 2020
4207d9e
scheduler: fix unbounds when removing source
csuzhangxc Feb 16, 2020
7711e6f
scheduler: add SendRequest API for worker agent
csuzhangxc Feb 16, 2020
19e329b
scheduler: add more comments and test cases
csuzhangxc Feb 16, 2020
bdbe56f
scheduler: address comment to fix deadlock
csuzhangxc Feb 17, 2020
713640e
remove coordinator, switch to scheduler api
lichunzhu Feb 17, 2020
5d07e1c
Merge branch 'ha-refactor' of https://github.com/pingcap/dm into ha-r…
lichunzhu Feb 17, 2020
c2a7651
refine clear env method
lichunzhu Feb 17, 2020
095f7a4
clear etcd info
lichunzhu Feb 17, 2020
0a9d45c
refine UT
lichunzhu Feb 17, 2020
7f841c1
refine master UT
lichunzhu Feb 17, 2020
8a2c88c
extract ClearTestInfoOperation
lichunzhu Feb 17, 2020
4a6dcb4
set mysqlConfig password to empty
lichunzhu Feb 17, 2020
595a11a
fix
lichunzhu Feb 17, 2020
0794bb0
*: remove the code about Coordinator
csuzhangxc Feb 17, 2020
ab12458
operate source before keepalive, refine logs and add purgeRelayDir
lichunzhu Feb 17, 2020
b7a863a
Merge branch 'ha-refactor' of https://github.com/lichunzhu/dm into ha…
lichunzhu Feb 17, 2020
88afe72
*: rename `MySQLConfig` to `SourceConfig`, `operate-worker` to `opera…
csuzhangxc Feb 17, 2020
7c7be03
Merge branch 'ha-refactor' of github.com:pingcap/dm into ha-refactor
csuzhangxc Feb 17, 2020
1795377
worker: fix merge
csuzhangxc Feb 17, 2020
af535f3
address comment
lichunzhu Feb 17, 2020
06b259b
switch to mock
lichunzhu Feb 17, 2020
fe8d7a3
*: encode check's config to toml; log the error about start work fail
csuzhangxc Feb 17, 2020
6128026
make worker not relying on downstream tidb
lichunzhu Feb 17, 2020
06a0332
Merge branch 'ha-refactor' of https://github.com/pingcap/dm into ha-r…
lichunzhu Feb 17, 2020
53890fd
*: rename `operate-worker` to `operate-source`, `dm-mysql.toml` to `s…
csuzhangxc Feb 17, 2020
ac2da20
Merge branch 'ha-refactor' of github.com:pingcap/dm into ha-refactor
csuzhangxc Feb 17, 2020
98da039
fix parse problem
lichunzhu Feb 17, 2020
b445e7c
fix parse problem again
lichunzhu Feb 18, 2020
0f5729f
scheduler: support add the same worker multiple times
csuzhangxc Feb 18, 2020
79f0991
add comments and UT for source configss
lichunzhu Feb 18, 2020
d21c2b7
Merge branch 'ha-refactor' of https://github.com/lichunzhu/dm into ha…
lichunzhu Feb 18, 2020
f41ef63
fix UT error
lichunzhu Feb 18, 2020
2783ba6
address comments
lichunzhu Feb 18, 2020
91bc029
Merge branch 'ha-refactor' of https://github.com/pingcap/dm into ha-r…
lichunzhu Feb 18, 2020
6e0c686
fix all_mode integration test, remove disable heartbeat
lichunzhu Feb 18, 2020
e5302d4
comment all enable-heartbeat in integration tests
lichunzhu Feb 18, 2020
d543271
refine dmctl_basic UT
lichunzhu Feb 18, 2020
badb6af
refine dmctl_basic UT part2
lichunzhu Feb 18, 2020
d8e34f5
refine dmctl_basic integration tests part.3
lichunzhu Feb 18, 2020
ead0984
tests: turn off relay_interrupt until compatible with relay again.
csuzhangxc Feb 18, 2020
c7e2296
tests: fix print_status
csuzhangxc Feb 18, 2020
230322c
tests: try to fix ha
csuzhangxc Feb 18, 2020
7899b28
tests: fix initial_unit
csuzhangxc Feb 18, 2020
39e9880
tests: try to fix ha
csuzhangxc Feb 18, 2020
633e6b6
refine dmctl_basic
lichunzhu Feb 18, 2020
10ac6e9
small fi
lichunzhu Feb 18, 2020
2f06e0f
tests: fix incremental_mode; update test_prepare
csuzhangxc Feb 18, 2020
9a05569
fix http_apis test
lichunzhu Feb 18, 2020
8a21f33
Merge branch 'ha-refactor' of https://github.com/pingcap/dm into ha-r…
lichunzhu Feb 18, 2020
894c8f6
sleep after operation in http_apis
lichunzhu Feb 18, 2020
1033e72
disable relay in ha integration tests
lichunzhu Feb 18, 2020
d437200
tests: rename schema in dm_syncer; skip online DDL
csuzhangxc Feb 18, 2020
c7e3abe
refine start_task test
lichunzhu Feb 18, 2020
5b384b3
tests: revert online_ddl case; remove retry_cancel
csuzhangxc Feb 18, 2020
72ade95
Merge remote-tracking branch 'remotes/origin/ha-dev' into ha-refactor
csuzhangxc Feb 18, 2020
15aa987
*: fix merge
csuzhangxc Feb 18, 2020
2d4b386
*: fix merge
csuzhangxc Feb 18, 2020
355c674
tests: abort online_ddl
csuzhangxc Feb 18, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
146 changes: 146 additions & 0 deletions pkg/ha/doc.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
// Copyright 2019 PingCAP, Inc.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// See the License for the specific language governing permissions and
// limitations under the License.

package ha

// Data need to be persisted for the HA scheduler.
// - configuration:
// - the upstream MySQL config (content of `MysqlConfig`):
// - PUT when adding an upstream (`operate-worker create`) by DM-master.
// - verify the validation before PUT it into etcd.
// - GET when scheduling the source to a DM-worker instance by DM-worker.
// - DELETE when removing an upstream (`operate-worker stop`) by DM-master.
// - DELETE with `the expectant stage of the relay` in one txn.
// - DELETE with `the bound relationship between the DM-worker instance and the upstream MySQL source` in one txn.
// - TODO: UPDATE support with `the expectant stage of the relay`.
// - the data migration task config (content of `TaskConfig`):
// - PUT when starting a task (`start-task`) by DM-master.
// - verify the validation before PUT it into etcd.
// - PUT with `the expectant stage of the subtask` in one txn.
// - GET when starting a subtask by DM-worker.
// - DELETE when stopping a task (`stop-task`) by DM-master.
// - DELETE with `the expectant stage of the subtask` in one txn.
// - TODO: UPDATE support with `the expectant stage of the subtask`.
//
// - node information (name, address, etc.):
// - the DM-worker instance:
// - PUT when adding a DM-worker instance by DM-master.
// - GET only when restoring the in-memory information after the leader of DM-master changed by the new leader.
// - DELETE when removing a DM-worker instance by DM-master.
// - TODO: UPDATE support later.
//
// - the health status (or keep-alive) of component instances:
// - the DM-worker instance:
// - PUT (keep-alive) by DM-worker (when the node is healthy).
// - GET (through WATCH) by DM-master to know is another schedule needed.
csuzhangxc marked this conversation as resolved.
Show resolved Hide resolved
// - DELETE (when the lease is timeout) by etcd (when the node is un-healthy).
// - no need to UPDATE it manually.
//
// - the running stage:
// - NOTE: persist the current stage of the relay and subtask if needed later.
// - the bound relationship between the DM-worker instance and the upstream MySQL source (including relevant relay and subtasks):
// - PUT when scheduling the source to a DM-worker instance by DM-master.
// - PUT with `the expectant stage of the relay` in one txn.
// - GET (through WATCH) by DM-worker to know relevant relay/subtasks have to do.
// - DELETE when removing an upstream by DM-master.
// - DELETE with `the upstream MySQL config` in one txn.
// - DELETE with `the expectant stage of the relay` in one txn.
// - UPDATE when scheduling the source to another DM-worker instance by DM-master.
lichunzhu marked this conversation as resolved.
Show resolved Hide resolved
// - the expectant stage of the relay:
// - PUT when scheduling the source to a DM-worker instance by DM-master.
// - PUT with `the bound relationship between the DM-worker instance and the upstream MySQL source` in one txn.
// - GET (through GET/WATCH) by DM-worker to know how to update the current stage.
// - UPDATE when handling the user request (pause-relay/resume-relay) by DM-master.
// - DELETE when removing an upstream by DM-master.
// - DELETE with `the upstream MySQL config` in one txn.
// - DELETE with `the bound relationship between the DM-worker instance and the upstream MySQL source` in one txn.
// - the expectant stage of the subtask:
// - PUT/DELETE/UPDATE when handling the user request (start-task/stop-task/pause-task/resume-task) by DM-master.
// - GET (through GET/WATCH) by DM-worker to know how to update the current stage.
//
// The summary of the above:
// - only the DM-master WRITE schedule operations
// - NOTE: the DM-worker WRITE (PUT) its information and health status.
// - the DM-worker READ schedule operations and obey them.
// In other words, behaviors of the cluster are clear, that are decisions made by the DM-master.
// As long as the DM-worker can connect to the cluster, it must obey these decisions.
// If the DM-worker can't connect to the cluster, it must shutdown all operations.
//
// In this model, we use etcd as the command queue for communication between the DM-master and DM-worker instead of gRPC.
//
// One example of the workflow:
// 0. the user starts the DM-master cluster, and GET all history persisted data described above.
// - restore the in-memory status.
// 1. the user starts a DM-worker instance.
// - PUT DM-worker instance information into etcd.
// 2. DM-master GET the information of the DM-worker instance, and mark it as `free` status.
// 3. the user adds an upstream config.
// - PUT the config of the upstream into etcd.
// 4. DM-master schedules the upstream relevant operations to the free DM-worker.
// - PUT the bound relationship.
// - PUT the expectant stage of the relay if not exists.
// 5. DM-worker GET the bound relationship, the config of the upstream and the expectant stage of the relay.
// 6. DM-worker obey the expectant stage of the relay.
// - start relay (if error occurred, wait for the user to resolve it and do not re-schedule it to other DM-worker instances).
// 7. the user starts a data migration task.
// 8. DM-master PUT the data migration task config and the expectant stage of subtasks into etcd.
// 9. DM-worker GET the config of the subtask, the expectant stage of the subtask.
// 10. DM-worker obey the expectant stage of the subtask
// - start the subtask (if error occurred, wait for the user to resolve it).
// 11. the task keeps running for a period.
// 12. the user pauses the task.
// 13. DM-master PUT the expectant stage of the subtask.
// 14. DM-worker obey the expectant stage of the subtask.
// 15. the user resumes the task (DM-master and DM-worker handle it similar to pause the task).
// 16. the user stops the task.
// 17. DM-master DELETE the data migration task config and the expectant stage of subtasks in etcd.
// - DELETE the information before subtasks shutdown.
// 18. DM-worker stops the subtask.
// - NOTE: DM-worker should always stop the subtask if the expectant stage of the subtask is missing.
// 19. the relay of the DM-worker continues to run.
// 20. the user remove the upstream config.
// 21. DM-master DELETE the upstream MySQL config, the bound relationship and the expectant stage of the relay.
// 22. DM-worker shutdown.
// 23. the user marks the DM-worker as offline.
// - DELETE DM-worker instance information in etcd.
//
// when the DM-worker (with relay and subtasks) is down:
// 0. the status of the old DM-worker is un-health (keep-alive failed).
// 1. DM-master choose another DM-worker instance for failover.
WangXiangUSTC marked this conversation as resolved.
Show resolved Hide resolved
// 2. DM-master UPDATE the bound relationship to the new DM-worker.
// 3. the new DM-worker GET upstream config, the expectant stage of the relay and the expectant stage of the subtasks.
// 4. the new DM-worker obey the expectant stage.
//
// when the leader of the DM-master cluster changed:
// 0. the old DM-master shutdown its operation.
// 1. the new DM-master GET all history information to restore the in-memory status.
// 2. the new DM-master continue to handle user requests and scheduler for upstream sources.
//
// the operation for expectant stage (both for the relay and subtasks):
// - New:
// - not a valid expectant stage.
// - always mark the expectant stage as Running for the first create.
// - Running (schedule the source to the DM-worker, resume-relay or start-task, resume-task):
// - create and start if the relay/subtask instance not exists.
// - resume when in Paused currently.
// - invalid for other current stages, do nothing.
// - Paused (pause-relay or pause-task):
// - do nothing if the relay/subtask instance not exists.
// - pause when in Running currently.
// - invalid for other current stages, do nothing.
// - Stopped (stop-relay or stop-task):
// - never exists for expectant stage in etcd but DELETE the relevant information.
// - do nothing if the relay/subtask instance not exists.
// - stop if the relay/subtask instance exists.
// - Finished:
// - never exists for expectant stage in etcd but DELETE the relevant information.