Skip to content
This repository has been archived by the owner on Nov 24, 2023. It is now read-only.

Make DM become HA #406

Open
12 of 23 tasks
csuzhangxc opened this issue Dec 6, 2019 · 4 comments
Open
12 of 23 tasks

Make DM become HA #406

csuzhangxc opened this issue Dec 6, 2019 · 4 comments
Labels
type/feature-request This issue is a feature request

Comments

@csuzhangxc
Copy link
Member

csuzhangxc commented Dec 6, 2019

Overview

  • Make DM become HA, which includes making it tolerant to process crashes, machine failures, network partitioning and more.
  • Combine DM-master into DM-worker as one binary file.

Problem Statement

  • The current set of DM cluster has only one DM-master process. When the DM-master process is abnormal, task management, cluster management, and shard DDL coordination can no longer be performed.
  • The current set of DM cluster has only one DM-worker process for each upstream MySQL/MariaDB instance. When the DM-worker process is abnormal, data migration related to the upstream MySQL/MariaDB instance is interrupted.
  • For some metadata (task configuration, cluster version, etc.), there is only one copy of data locally on the DM-worker node. When the local data copy is damaged, the data migration task is difficult to recover.

Proposed Solution

  • Combine DM-master into DM-worker to reduce cluster components (Let's still call it DM-worker below).
  • Embed etcd into each DM-worker process as a cluster for storing metadata and leader election.
  • Elect a leader for DM-worker instances based on the etcd cluster to handle user requests (from dmctl or other HTTP clients, just like the previous DM-master did). If the original leader fails, a new leader is automatically elected. Note, the entire DM cluster has at most one leader at a time (Let's call it leader below).
  • Support deploying one or more DM-worker instances for each upstream MySQL/MariaDB instance and electing an instance to run related data migration subtasks. If the original instance fails, a new instance is automatically elected to run the data migration subtasks. Note, for each upstream MySQL/MariaDB instance, at most one DM-worker instance is running the data migration subtasks at the same time. One example:
    • for MySQL-A, two DM-workers (DM-worker-1 and DM-worker-2) exist. Then we need to elect DM-worker-1 or DM-worker-2 to run the subtasks, and we mark the elected DM-worker as the running instance and others as idle instances.
    • for MySQL-B, only one DM-worker (DM-worker-3) exists. Then we elect DM-worker-3 to run the subtasks.

Success Criteria

  • If multiple DM-worker instances are deployed, user requests can still be handled when no more than half of the instances are abnormal.
  • As long as more than half of the DM-worker instances are available, metadata reads and writes are not affected.
  • As long as there are one or more available DM-worker instances corresponding to an upstream MySQL/MariaDB instance, the data migration associated with that upstream is not affected.
  • Deploying only one DM-worker instance can still handle data migration tasks when there is only one upstream MySQL/MariaDB instance.

Difficulty

  • Hard

Score

  • 6000

Mentor(s)

TODO list

  • Embed etcd into each DM-master process as a cluster for storing metadata and leader election.
  • Build an API-server or operation queue based on the etcd cluster, partially done in master: use etcd as operate queue #363.
  • Elect a DM-worker to run upstream related data migration subtasks, partially done in dm-worker: support election with same source-id #399.
  • Register DM-worker dynamically for specified upstream, partially done in *: register DM-worker instance to DM-master dynamically #394.
  • Task scheduling
    • Manage (add, delete, update) upstream MySQL related information in DM-master.
    • Register DM-worker instance into DM-master.
    • Redirect API requests from DM-master followers to the leader.
    • Remove the old subtask operation queue in DM-worker.
    • Refresh lease when running subtask for DM-worker.
    • Reschedule the data migration subtask when the DM-worker is abnormal.
    • Retry task operations for multi DM-master in dmct.
    • Schedule the data migration subtask based on topology.
  • Combine DM-master and DM-worker into one binary.
    • Move API server from DM-master into DM-worker.
    • Move scheduler from Dm-master into DM-worker.
    • Update startup behavior to support electing master role in DM-worker instances.
    • Promote DM-worker instances from worker/idle role to master role automatically/manually.
    • Refresh master role instances list in DM-workers automatically after changed.
    • Hot update some configuration items (like log level) in DM-worker.

Design

References

@csuzhangxc csuzhangxc added the type/feature-request This issue is a feature request label Dec 6, 2019
@Little-Wallace
Copy link
Contributor

I want to join in this task~

1 similar comment
@nolouch
Copy link
Member

nolouch commented Dec 9, 2019

I want to join in this task~

@IANTHEREAL
Copy link
Collaborator

IANTHEREAL commented Dec 11, 2019

@csuzhangxc @nolouch you can link some discussed conclusion documents here, don't be afraid that the document is in Chinese, maybe some companies or communities are interested in it

@csuzhangxc
Copy link
Member Author

@csuzhangxc @nolouch you can link some discussed conclusion documents here, don't be afraid that the document is in Chinese, maybe some companies or communities are interested in it

@GregoryIan I've linked the document in the Design section above and updated the TODO list too.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
type/feature-request This issue is a feature request
Projects
None yet
Development

No branches or pull requests

4 participants