Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DM use a wrong GTID set when do finer grained retry #3711

Closed
lance6716 opened this issue Dec 3, 2021 · 2 comments · Fixed by #4130
Closed

DM use a wrong GTID set when do finer grained retry #3711

lance6716 opened this issue Dec 3, 2021 · 2 comments · Fixed by #4130
Assignees
Labels
area/dm Issues or PRs related to DM. severity/critical type/bug The issue is confirmed as a bug.

Comments

@lance6716
Copy link
Contributor

lance6716 commented Dec 3, 2021

What did you do?

Upstream has a GTID of binlog events:

  1. GTID event
  2. Query event (BEGIN)
  3. ...

after upstream send event 2, lastLocation becomes the GTID set including this GTID in https://github.com/pingcap/ticdc/blob/20626babf21fc381d4364646c40dd84598533d66/dm/syncer/syncer.go#L2126-L2135

now upstream raise a network error, so getEvent returns an error

https://github.com/pingcap/ticdc/blob/20626babf21fc381d4364646c40dd84598533d66/dm/syncer/syncer.go#L1590

Now DM will do a finer grained retry

What did you expect to see?

no data loss

What did you see instead?

data is lost because the retry is using lastLocation.

https://github.com/pingcap/ticdc/blob/20626babf21fc381d4364646c40dd84598533d66/dm/syncer/syncer.go#L1622-L1624

However as described above (comment is wrong), lastLocation becomes the GTID set including this GTID, so using it will not send this GTID again.

Versions of the cluster

DM version (run dmctl -V or dm-worker -V or dm-master -V):

v5.3.0

Upstream MySQL/MariaDB server version:

(paste upstream MySQL/MariaDB server version here)

Downstream TiDB cluster version (execute SELECT tidb_version(); in a MySQL client):

(paste TiDB cluster version here)

How did you deploy DM: tiup or manually?

(leave TiUP or manually here)

Other interesting information (system version, hardware config, etc):

>
>

current status of DM cluster (execute query-status <task-name> in dmctl)

(paste current status of DM cluster here)
@lance6716 lance6716 added type/bug The issue is confirmed as a bug. area/dm Issues or PRs related to DM. labels Dec 3, 2021
@lance6716
Copy link
Contributor Author

@cyliu0

@lance6716
Copy link
Contributor Author

lance6716 commented Dec 29, 2021

After #3860, the problem now has been changed to:

We now use an early location to reset upstream replication so will not lose data. But we may replicate duplicate data so will meet a "duplicate entry" error. For GTID-based replication, we use maybeSkipNRowsEvent to skip duplicate events, but for position-based replicate, we have no way to protect it.

I'll port #3860 to master branch as an ugly fix, which is always open safe mode for one transaction. After the ugly fix I'll change it to a feature request and wait a better fix.

The better fix has two candidate:

  • after syncer(dm): don't check flush inside an event and refactor #3891, when it's position based replication, table checkpoint can use <= table checkpoint rather than < table checkpoint to skip events. So no duplicate events after we use <=.
  • we also use maybeSkipNRowsEvent no matter it's position-based or GTID-based replication.

cc @GMHDBJD @glorv

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/dm Issues or PRs related to DM. severity/critical type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant