DM use a wrong GTID set when do finer grained retry #3711

lance6716 · 2021-12-03T09:22:18Z

What did you do?

Upstream has a GTID of binlog events:

GTID event
Query event (BEGIN)
...

after upstream send event 2, lastLocation becomes the GTID set including this GTID in https://github.com/pingcap/ticdc/blob/20626babf21fc381d4364646c40dd84598533d66/dm/syncer/syncer.go#L2126-L2135

now upstream raise a network error, so getEvent returns an error

https://github.com/pingcap/ticdc/blob/20626babf21fc381d4364646c40dd84598533d66/dm/syncer/syncer.go#L1590

Now DM will do a finer grained retry

What did you expect to see?

no data loss

What did you see instead?

data is lost because the retry is using lastLocation.

https://github.com/pingcap/ticdc/blob/20626babf21fc381d4364646c40dd84598533d66/dm/syncer/syncer.go#L1622-L1624

However as described above (comment is wrong), lastLocation becomes the GTID set including this GTID, so using it will not send this GTID again.

Versions of the cluster

DM version (run dmctl -V or dm-worker -V or dm-master -V):

v5.3.0

Upstream MySQL/MariaDB server version:

(paste upstream MySQL/MariaDB server version here)

Downstream TiDB cluster version (execute SELECT tidb_version(); in a MySQL client):

(paste TiDB cluster version here)

How did you deploy DM: tiup or manually?

(leave TiUP or manually here)

Other interesting information (system version, hardware config, etc):

>
>

current status of DM cluster (execute `query-status <task-name>` in dmctl)

(paste current status of DM cluster here)

The text was updated successfully, but these errors were encountered:

lance6716 · 2021-12-03T09:26:33Z

@cyliu0

lance6716 · 2021-12-29T06:48:02Z

After #3860, the problem now has been changed to:

We now use an early location to reset upstream replication so will not lose data. But we may replicate duplicate data so will meet a "duplicate entry" error. For GTID-based replication, we use maybeSkipNRowsEvent to skip duplicate events, but for position-based replicate, we have no way to protect it.

~~I'll port #3860 to master branch as an ugly fix, which is always open safe mode for one transaction. After the ugly fix I'll change it to a feature request and wait a better fix.~~

The better fix has two candidate:

after syncer(dm): don't check flush inside an event and refactor #3891, when it's position based replication, table checkpoint can use <= table checkpoint rather than < table checkpoint to skip events. So no duplicate events after we use <=.
we also use maybeSkipNRowsEvent no matter it's position-based or GTID-based replication.

cc @GMHDBJD @glorv

…#4130) close #3711

lance6716 added type/bug The issue is confirmed as a bug. area/dm Issues or PRs related to DM. labels Dec 3, 2021

lance6716 added the severity/critical label Dec 3, 2021

XuJianxu assigned lance6716 Dec 9, 2021

lance6716 mentioned this issue Dec 13, 2021

syncer(dm): use an early location to reset binlog and open safemode #3860

Merged

glorv mentioned this issue Dec 20, 2021

data lost in chaos with enable-relay #3487

Closed

lance6716 mentioned this issue Dec 29, 2021

syncer: also skip after reset upstream connection when position-based #4130

Merged

ti-chi-bot closed this as completed in #4130 Dec 31, 2021

ti-chi-bot pushed a commit that referenced this issue Dec 31, 2021

syncer: also skip after reset upstream connection when position-based (…

d1dafee

…#4130) close #3711

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DM use a wrong GTID set when do finer grained retry #3711

DM use a wrong GTID set when do finer grained retry #3711

lance6716 commented Dec 3, 2021 •

edited

Loading

lance6716 commented Dec 3, 2021

lance6716 commented Dec 29, 2021 •

edited

Loading

DM use a wrong GTID set when do finer grained retry #3711

DM use a wrong GTID set when do finer grained retry #3711

Comments

lance6716 commented Dec 3, 2021 • edited Loading

What did you do?

What did you expect to see?

What did you see instead?

Versions of the cluster

current status of DM cluster (execute query-status <task-name> in dmctl)

lance6716 commented Dec 3, 2021

lance6716 commented Dec 29, 2021 • edited Loading

lance6716 commented Dec 3, 2021 •

edited

Loading

current status of DM cluster (execute `query-status <task-name>` in dmctl)

lance6716 commented Dec 29, 2021 •

edited

Loading