Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

syncer task is paused after connection recover #11384

Open
giant-panda666 opened this issue Jul 3, 2024 · 8 comments
Open

syncer task is paused after connection recover #11384

giant-panda666 opened this issue Jul 3, 2024 · 8 comments

Comments

@giant-panda666
Copy link

What did you do?

the connection from dm-worker to downstream TiDB is interrupted and the error is [error="[code=10006:class=database:scope=not-set:level=high], Message: execute statement failed: begin, RawCause: invalid connection"]. After network failure was fixed, the paused task is resumed. But the insert statement has been exected while the checkpoint is not updated, and dm-worker isn't enter auto-safe mode, so the insert statement is executed again. Sadly the syncing task is paused again because of duplicated primary key.

What did you expect to see?

after network failure was fixed, the paused task should be resumed and it can handle duplicated key error.

What did you see instead?

the syncing task paused again.

Versions of the cluster

DM version (run dmctl -V or dm-worker -V or dm-master -V):

v7.5.1

Upstream MySQL/MariaDB server version:

MySQL 5.7

Downstream TiDB cluster version (execute SELECT tidb_version(); in a MySQL client):

v7.5.1

How did you deploy DM: tiup or manually?

TiDB Operator

Other interesting information (system version, hardware config, etc):

>
>

current status of DM cluster (execute query-status <task-name> in dmctl)

(paste current status of DM cluster here)
@giant-panda666 giant-panda666 added area/dm Issues or PRs related to DM. type/bug The issue is confirmed as a bug. labels Jul 3, 2024
@lance6716
Copy link
Contributor

Can you provide the log to show "and dm-worker isn't enter auto-safe mode"?

@giant-panda666
Copy link
Author

worker.log

well, there are many errors in log. But i can remember what happened in the whole process is from 2024/07/05 13:02 to 2024/07/05 13:19. At 2024/07/05 13:02, the k8s node on which this worker was connected to TiDB server was draining. At 2024/07/05 13:19 the k8s node was recovered to accept pod scheduler. I found the worker was paused, so i do resume-task command to resume the task.

@giant-panda666
Copy link
Author

from source code:

  1. execute sql in dm worker:
    affect, err = db.ExecuteSQL(ctx, w.metricProxies, queries, args...)
  2. execute sql in dbconn:
    return conn.ExecuteSQLWithIgnore(tctx, metricProxies, nil, queries, args...)
  3. execute sql with ignore error:
    ret, _, err := conn.baseConn.ApplyRetryStrategy(

so if apply succeed in fact, but conn becomes bad connection, it will be retried because of retryable error. However, it will be failed if applied statement is insert statement.

@lance6716
Copy link
Contributor

I need the full log to understand what's happened, especially for the period of resuming the task action before meet "Duplicate entry" error around 2024/07/05 13:02:48.251 +00:00

When the task resumes, safe mode exit point should enable the safe mode to cover the partially replicated binlog events.

s.tctx.L().Info("enable safe-mode for safe mode exit point, will exit at", zap.Stringer("location", *exitPoint))

@giant-panda666
Copy link
Author

That's full log, because there are so many log, so i have changed log level to error(it seems not work in go-mysql.). maybe "After network failure was fixed, the paused task is resumed. " as above is wrong(i'm not so familiar with source code...), it should be connection is broken and then dbConn retry during which causing duplicated key.

@lance6716
Copy link
Contributor

it should be connection is broken and then dbConn retry during which causing duplicated key.

Oh I understand, will check it later.

@lance6716
Copy link
Contributor

worker.log

From the log file you provided, the error is "invalid connection", not "bad connection"

And DM will not retry sending the query for "invalid connection". So I still need full log to locate the problem.

@giant-panda666
Copy link
Author

no extra logs, log level is error and that's all. I have changed safe-mode to true, the problem will not be produced any more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Need Triage
Development

No branches or pull requests

3 participants