Make DM become HA #406

csuzhangxc · 2019-12-06T08:57:17Z

Overview

Make DM become HA, which includes making it tolerant to process crashes, machine failures, network partitioning and more.
Combine DM-master into DM-worker as one binary file.

Problem Statement

The current set of DM cluster has only one DM-master process. When the DM-master process is abnormal, task management, cluster management, and shard DDL coordination can no longer be performed.
The current set of DM cluster has only one DM-worker process for each upstream MySQL/MariaDB instance. When the DM-worker process is abnormal, data migration related to the upstream MySQL/MariaDB instance is interrupted.
For some metadata (task configuration, cluster version, etc.), there is only one copy of data locally on the DM-worker node. When the local data copy is damaged, the data migration task is difficult to recover.

Proposed Solution

Combine DM-master into DM-worker to reduce cluster components (Let's still call it DM-worker below).
Embed etcd into each DM-worker process as a cluster for storing metadata and leader election.
Elect a leader for DM-worker instances based on the etcd cluster to handle user requests (from dmctl or other HTTP clients, just like the previous DM-master did). If the original leader fails, a new leader is automatically elected. Note, the entire DM cluster has at most one leader at a time (Let's call it leader below).
Support deploying one or more DM-worker instances for each upstream MySQL/MariaDB instance and electing an instance to run related data migration subtasks. If the original instance fails, a new instance is automatically elected to run the data migration subtasks. Note, for each upstream MySQL/MariaDB instance, at most one DM-worker instance is running the data migration subtasks at the same time. One example:
- for MySQL-A, two DM-workers (DM-worker-1 and DM-worker-2) exist. Then we need to elect DM-worker-1 or DM-worker-2 to run the subtasks, and we mark the elected DM-worker as the running instance and others as idle instances.
- for MySQL-B, only one DM-worker (DM-worker-3) exists. Then we elect DM-worker-3 to run the subtasks.

Success Criteria

If multiple DM-worker instances are deployed, user requests can still be handled when no more than half of the instances are abnormal.
As long as more than half of the DM-worker instances are available, metadata reads and writes are not affected.
As long as there are one or more available DM-worker instances corresponding to an upstream MySQL/MariaDB instance, the data migration associated with that upstream is not affected.
Deploying only one DM-worker instance can still handle data migration tasks when there is only one upstream MySQL/MariaDB instance.

Difficulty

Hard

Score

6000

Mentor(s)

@csuzhangxc

TODO list

Design

DM 统一进程支持 HA

References

DM source code reading (Chinese version)
DM High Availability Design Document (Chinese version)
DM High Availability Minimization Refactoring (Chinese version)
Jira Epic about HA minimization refactoring of DM (Only visible internally)

Little-Wallace · 2019-12-09T06:30:30Z

I want to join in this task~

nolouch · 2019-12-09T07:19:25Z

I want to join in this task~

IANTHEREAL · 2019-12-11T08:26:30Z

@csuzhangxc @nolouch you can link some discussed conclusion documents here, don't be afraid that the document is in Chinese, maybe some companies or communities are interested in it

csuzhangxc · 2019-12-11T10:35:02Z

@csuzhangxc @nolouch you can link some discussed conclusion documents here, don't be afraid that the document is in Chinese, maybe some companies or communities are interested in it

@GregoryIan I've linked the document in the Design section above and updated the TODO list too.

csuzhangxc added the type/feature-request This issue is a feature request label Dec 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make DM become HA #406

Make DM become HA #406

csuzhangxc commented Dec 6, 2019 •

edited

Loading

Little-Wallace commented Dec 9, 2019

nolouch commented Dec 9, 2019

IANTHEREAL commented Dec 11, 2019 •

edited

Loading

csuzhangxc commented Dec 11, 2019

Make DM become HA #406

Make DM become HA #406

Comments

csuzhangxc commented Dec 6, 2019 • edited Loading

Overview

Problem Statement

Proposed Solution

Success Criteria

Difficulty

Score

Mentor(s)

TODO list

Design

References

Little-Wallace commented Dec 9, 2019

nolouch commented Dec 9, 2019

IANTHEREAL commented Dec 11, 2019 • edited Loading

csuzhangxc commented Dec 11, 2019

csuzhangxc commented Dec 6, 2019 •

edited

Loading

IANTHEREAL commented Dec 11, 2019 •

edited

Loading