Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ddl: limit the number of DDL job retries #7474

Closed
wants to merge 2 commits into from

Conversation

ciscoxll
Copy link
Contributor

@ciscoxll ciscoxll commented Aug 23, 2018

What problem does this PR solve?

This PR solves the problem of infinite retry when a DDL job error occurs.

What is changed and how it works?

  • If the current DDL job fails, it will be retried indefinitely, so we need to limit the number of retries.
  • Avoid error jobs blocking the entire DDL.
  • Add some logging and metrics.
  • Set the DDL job error retry number set @@global.tidb_ddl_error_retry_limit = 10000.
  • Fix issue limit the number of retries when a DDL job error occurs #7517

Check List

Tests

  • Unit test

PTAL @tiancaiamao @winkyao @zimulala .


This change is Reviewable

@@ -521,6 +522,54 @@ func (s *testDDLSuite) TestBuildJobDependence(c *C) {
})
}

func (s *testDDLSuite) TestErrorCountlimit(c *C) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please put it to the file of fail test.

@@ -460,6 +463,7 @@ func (w *worker) runDDLJob(d *ddlCtx, t *meta.Meta, job *model.Job) (ver int64,
job.State = model.JobStateCancelled
job.Error = errCancelledDDLJob
job.ErrorCount++
metrics.DDLWorkerHistogram.WithLabelValues(metrics.WorkerCancelDDLJob, metrics.RetLabel(err)).Observe(time.Since(startTime).Seconds())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this metric valuable?

@@ -51,6 +51,8 @@ const (
waitDependencyJobInterval = 200 * time.Millisecond
// noneDependencyJob means a job has no dependency-job.
noneDependencyJob = 0
// errorCountlimit limit the number of retries.
errorCountlimit = 10000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

10000 is too large, maybe 512 is enough.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a config item or system variable to set it?

} else {
log.Infof("[ddl-%s] the DDL job is normal to cancel because %v", w, errors.ErrorStack(err))
log.Infof("[ddl-%s] the DDL job is normal to cancel because %v job query %s", w, errors.ErrorStack(err), job.Query)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

, job query

@@ -519,13 +523,18 @@ func (w *worker) runDDLJob(d *ddlCtx, t *meta.Meta, job *model.Job) (ver int64,
if err != nil {
// If job is not cancelled, we should log this error.
if job.State != model.JobStateCancelled {
log.Errorf("[ddl-%s] run DDL job err %v", w, errors.ErrorStack(err))
log.Errorf("[ddl-%s] run DDL job err %v job query %s ", w, errors.ErrorStack(err), job.Query)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

, job query

}

job.Error = toTError(err)
job.ErrorCount++
if job.ErrorCount > errorCountlimit {
job.State = model.JobStateCancelling
metrics.DDLWorkerHistogram.WithLabelValues(metrics.WorkerCancelDDLJob, metrics.RetLabel(err)).Observe(time.Since(startTime).Seconds())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

startTime is the job start execute time or start time of the last execute the job?

@@ -51,6 +51,8 @@ const (
waitDependencyJobInterval = 200 * time.Millisecond
// noneDependencyJob means a job has no dependency-job.
noneDependencyJob = 0
// errorCountlimit limit the number of retries.
errorCountlimit = 512
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this be configurable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@winkyao PTAL.

// SetDDLErrorRetryLimit sets ddlErrorRetryLimit count.
func SetDDLErrorRetryLimit(cnt int32) {
if cnt < minDDLErrorRetryLimit {
cnt = ddlErrorRetryLimit
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not set to minDDLErrorRetryLimit?

@@ -223,4 +223,8 @@ func (s *testVarsutilSuite) TestVarsutil(c *C) {
c.Assert(err, IsNil)
c.Assert(val, Equals, "1")
c.Assert(v.EnableTablePartition, IsTrue)

c.Assert(GetDDLErrorRetryLimit(), Equals, int32(DefTiDBDDLErrorRetryLimit))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add more test case like set -1 to DefTiDBDDLErrorRetryLimit. Please make your unit tests cover more cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@winkyao Done.

}

job.Error = toTError(err)
job.ErrorCount++
if job.ErrorCount > int64(variable.GetDDLErrorRetryLimit()) {
startTime := time.Now()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why startTime is not job.StartTS

@@ -519,13 +520,19 @@ func (w *worker) runDDLJob(d *ddlCtx, t *meta.Meta, job *model.Job) (ver int64,
if err != nil {
// If job is not cancelled, we should log this error.
if job.State != model.JobStateCancelled {
log.Errorf("[ddl-%s] run DDL job err %v", w, errors.ErrorStack(err))
log.Errorf("[ddl-%s] run DDL job err %v, job query %s ", w, errors.ErrorStack(err), job.Query)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add metrics(WorkerCancelDDLJob) here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zimulala Done.

@@ -214,6 +217,7 @@ const (
DefTiDBDDLReorgWorkerCount = 16
DefTiDBHashAggPartialConcurrency = 4
DefTiDBHashAggFinalConcurrency = 4
DefTiDBDDLErrorRetryLimit = 512
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this value be too small?

}

job.Error = toTError(err)
job.ErrorCount++
if job.ErrorCount > int64(variable.GetDDLErrorRetryLimit()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add a log here?

@@ -293,6 +293,9 @@ type SessionVars struct {
EnableStreaming bool

writeStmtBufs WriteStmtBufs

// DDLErrorRetryLimit limits the number of error retries that occur in ddl job.
DDLErrorRetryLimit int64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this variable useful?

@ciscoxll ciscoxll force-pushed the retry-limit branch 5 times, most recently from b6d83fa to 1f19c1b Compare August 30, 2018 06:58
@ciscoxll
Copy link
Contributor Author

ciscoxll commented Sep 3, 2018

/run-all-tests

@ciscoxll
Copy link
Contributor Author

ciscoxll commented Sep 4, 2018

@crazycs520 @winkyao @zimulala PTAL .

1 similar comment
@ciscoxll
Copy link
Contributor Author

ciscoxll commented Sep 5, 2018

@crazycs520 @winkyao @zimulala PTAL .

}

job.Error = toTError(err)
job.ErrorCount++
if job.ErrorCount > int64(variable.GetDDLErrorRetryLimit()) && job.Type != model.ActionAddIndex {
log.Infof("[ddl-%s] DDL job over maximum retry count is canceled because %v, job query %s", w, errors.ErrorStack(err), job.Query)
job.State = model.JobStateCancelling
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a problem with the following scenario?
If the operation of “Add column” updates to “Delete only”, then we cancel this job.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zimulala I write a test and canceled this job, but no error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ciscoxll
If the column is “drop column” updates to “delete only”, but this job occurs many errors then this job cancel successfully. Then we return an error to the client, but this column's state is "delete only".
I think this will be a problem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zimulala I will write a rollback in another PR.

Copy link
Contributor

@zimulala zimulala Feb 13, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Handle it on #9295

@ciscoxll
Copy link
Contributor Author

ciscoxll commented Sep 6, 2018

@winkyao PTAL.

if job.ErrorCount > int64(variable.GetDDLErrorRetryLimit()) && job.Type != model.ActionAddIndex {
log.Infof("[ddl-%s] DDL job over maximum retry count is canceled because %v, job query %s", w, errors.ErrorStack(err), job.Query)
job.State = model.JobStateCancelling
metrics.DDLWorkerHistogram.WithLabelValues(metrics.WorkerCancelDDLJob, job.Type.String(), metrics.RetLabel(err)).Observe(time.Since(model.TSConvert2Time(job.StartTS)).Seconds())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Repeat with line#529?

@winkyao winkyao closed this Nov 23, 2018
@sre-bot sre-bot added the contribution This PR is from a community contributor. label Dec 18, 2019
@you06 you06 added the sig/sql-infra SIG: SQL Infra label Mar 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
contribution This PR is from a community contributor. sig/sql-infra SIG: SQL Infra
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants