syncer task is paused after connection recover #11384

giant-panda666 · 2024-07-03T13:55:54Z

What did you do?

the connection from dm-worker to downstream TiDB is interrupted and the error is [error="[code=10006:class=database:scope=not-set:level=high], Message: execute statement failed: begin, RawCause: invalid connection"]. After network failure was fixed, the paused task is resumed. But the insert statement has been exected while the checkpoint is not updated, and dm-worker isn't enter auto-safe mode, so the insert statement is executed again. Sadly the syncing task is paused again because of duplicated primary key.

What did you expect to see?

after network failure was fixed, the paused task should be resumed and it can handle duplicated key error.

What did you see instead?

the syncing task paused again.

Versions of the cluster

DM version (run dmctl -V or dm-worker -V or dm-master -V):

v7.5.1

Upstream MySQL/MariaDB server version:

MySQL 5.7

Downstream TiDB cluster version (execute SELECT tidb_version(); in a MySQL client):

v7.5.1

How did you deploy DM: tiup or manually?

TiDB Operator

Other interesting information (system version, hardware config, etc):

>
>

current status of DM cluster (execute `query-status <task-name>` in dmctl)

(paste current status of DM cluster here)

The text was updated successfully, but these errors were encountered:

lance6716 · 2024-07-04T07:20:12Z

Can you provide the log to show "and dm-worker isn't enter auto-safe mode"?

giant-panda666 · 2024-07-05T20:47:03Z

worker.log

well, there are many errors in log. But i can remember what happened in the whole process is from 2024/07/05 13:02 to 2024/07/05 13:19. At 2024/07/05 13:02, the k8s node on which this worker was connected to TiDB server was draining. At 2024/07/05 13:19 the k8s node was recovered to accept pod scheduler. I found the worker was paused, so i do resume-task command to resume the task.

giant-panda666 · 2024-07-07T13:53:21Z

from source code:

execute sql in dm worker:

tiflow/dm/syncer/dml_worker.go

Line 259 in 1252979

affect, err = db.ExecuteSQL(ctx, w.metricProxies, queries, args...)
execute sql in dbconn:

tiflow/dm/syncer/dbconn/db.go

Line 220 in 1252979

return conn.ExecuteSQLWithIgnore(tctx, metricProxies, nil, queries, args...)
execute sql with ignore error:

tiflow/dm/syncer/dbconn/db.go

Line 172 in 1252979

ret, _, err := conn.baseConn.ApplyRetryStrategy(

so if apply succeed in fact, but conn becomes bad connection, it will be retried because of retryable error. However, it will be failed if applied statement is insert statement.

lance6716 · 2024-07-08T06:07:05Z

I need the full log to understand what's happened, especially for the period of resuming the task action before meet "Duplicate entry" error around 2024/07/05 13:02:48.251 +00:00

When the task resumes, safe mode exit point should enable the safe mode to cover the partially replicated binlog events.

tiflow/dm/syncer/safe_mode.go

Line 94 in 707238a

    
           s.tctx.L().Info("enable safe-mode for safe mode exit point, will exit at", zap.Stringer("location", *exitPoint))

giant-panda666 · 2024-07-09T01:59:44Z

That's full log, because there are so many log, so i have changed log level to error(it seems not work in go-mysql.). maybe "After network failure was fixed, the paused task is resumed. " as above is wrong(i'm not so familiar with source code...), it should be connection is broken and then dbConn retry during which causing duplicated key.

lance6716 · 2024-07-09T03:19:48Z

it should be connection is broken and then dbConn retry during which causing duplicated key.

Oh I understand, will check it later.

lance6716 · 2024-07-10T07:25:25Z

worker.log

From the log file you provided, the error is "invalid connection", not "bad connection"

And DM will not retry sending the query for "invalid connection". So I still need full log to locate the problem.

giant-panda666 · 2024-07-10T07:35:44Z

no extra logs, log level is error and that's all. I have changed safe-mode to true, the problem will not be produced any more.

giant-panda666 added area/dm Issues or PRs related to DM. type/bug The issue is confirmed as a bug. labels Jul 3, 2024

jebter added the severity/major label Jul 12, 2024

ti-chi-bot bot added may-affects-5.4 may-affects-6.1 may-affects-6.5 may-affects-7.1 may-affects-7.5 may-affects-8.1 labels Jul 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

syncer task is paused after connection recover #11384

syncer task is paused after connection recover #11384

giant-panda666 commented Jul 3, 2024

lance6716 commented Jul 4, 2024

giant-panda666 commented Jul 5, 2024

giant-panda666 commented Jul 7, 2024

lance6716 commented Jul 8, 2024

giant-panda666 commented Jul 9, 2024

lance6716 commented Jul 9, 2024

lance6716 commented Jul 10, 2024

giant-panda666 commented Jul 10, 2024

syncer task is paused after connection recover #11384

syncer task is paused after connection recover #11384

Comments

giant-panda666 commented Jul 3, 2024

What did you do?

What did you expect to see?

What did you see instead?

Versions of the cluster

current status of DM cluster (execute query-status <task-name> in dmctl)

lance6716 commented Jul 4, 2024

giant-panda666 commented Jul 5, 2024

giant-panda666 commented Jul 7, 2024

lance6716 commented Jul 8, 2024

giant-panda666 commented Jul 9, 2024

lance6716 commented Jul 9, 2024

lance6716 commented Jul 10, 2024

giant-panda666 commented Jul 10, 2024

current status of DM cluster (execute `query-status <task-name>` in dmctl)