Skip to content
This repository has been archived by the owner on Nov 24, 2023. It is now read-only.

DM may lost data when resume from error in non GTID mode #1751

Closed
lance6716 opened this issue Jun 5, 2021 · 1 comment · Fixed by #1752
Closed

DM may lost data when resume from error in non GTID mode #1751

lance6716 opened this issue Jun 5, 2021 · 1 comment · Fixed by #1752
Assignees
Labels
affected-v2.0.2 this issue/BUG affects v2.0.2 affected-v2.0.3 this issue/BUG affects v2.0.3 severity/critical type/bug This issue is a bug report

Comments

@lance6716
Copy link
Collaborator

lance6716 commented Jun 5, 2021

Bug Report

Please answer these questions before submitting your issue. Thanks!

  1. What did you do? If possible, provide a recipe for reproducing the error.

Upstream uses non GTID replication and generates binlog events which contain more than one row changes. Assuming that DM is handling event e1, if DM meets error and resumes, some row changes of event e1 may get lost.

  1. Explanation.

DM will split event and generate jobs per row changes.

binlog event with 2 row changes
              │
      ┌───────▼───────┐
      │handleRowsEvent│
      └───────┬───────┘
              ▼
         job1, job2

Above 2 jobs are attached with same event position p0 and table information t0. DM will invoke addJob sequentially for jobs.

In addJob, DM will

  1. send job to a random worker
  2. check this job needing a flush
  3. if need flush, wait workers to finish all jobs
  4. save checkpoint for table
  5. if need flush, flush checkpoints

Now if job1 needs flush (may caused by reaching 30s flush interval), after step 5, the checkpoint for the table t0 is flushed with p0. If unfortunately SQL of job2 fails executing, after resuming, DM will skip t0's binlog events whose position <= p0. So job2 is lost.

@lance6716 lance6716 added affected-v2.0.2 this issue/BUG affects v2.0.2 affected-v2.0.3 this issue/BUG affects v2.0.3 severity/critical type/bug This issue is a bug report labels Jun 5, 2021
@lance6716 lance6716 changed the title DM may lost data when resume from synchronization error in non GTID mode DM may lost data when resume from error in non GTID mode Jun 5, 2021
@lance6716
Copy link
Collaborator Author

in version 1.0, relay unit to sync unit is always non-GTID replication.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
affected-v2.0.2 this issue/BUG affects v2.0.2 affected-v2.0.3 this issue/BUG affects v2.0.3 severity/critical type/bug This issue is a bug report
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant