-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ddl: limit the number of DDL job retries #7474
Conversation
ddl/ddl_worker_test.go
Outdated
@@ -521,6 +522,54 @@ func (s *testDDLSuite) TestBuildJobDependence(c *C) { | |||
}) | |||
} | |||
|
|||
func (s *testDDLSuite) TestErrorCountlimit(c *C) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please put it to the file of fail test.
ddl/ddl_worker.go
Outdated
@@ -460,6 +463,7 @@ func (w *worker) runDDLJob(d *ddlCtx, t *meta.Meta, job *model.Job) (ver int64, | |||
job.State = model.JobStateCancelled | |||
job.Error = errCancelledDDLJob | |||
job.ErrorCount++ | |||
metrics.DDLWorkerHistogram.WithLabelValues(metrics.WorkerCancelDDLJob, metrics.RetLabel(err)).Observe(time.Since(startTime).Seconds()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this metric valuable?
ddl/ddl_worker.go
Outdated
@@ -51,6 +51,8 @@ const ( | |||
waitDependencyJobInterval = 200 * time.Millisecond | |||
// noneDependencyJob means a job has no dependency-job. | |||
noneDependencyJob = 0 | |||
// errorCountlimit limit the number of retries. | |||
errorCountlimit = 10000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
10000 is too large, maybe 512 is enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add a config item or system variable to set it?
ddl/ddl_worker.go
Outdated
} else { | ||
log.Infof("[ddl-%s] the DDL job is normal to cancel because %v", w, errors.ErrorStack(err)) | ||
log.Infof("[ddl-%s] the DDL job is normal to cancel because %v job query %s", w, errors.ErrorStack(err), job.Query) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
, job query
ddl/ddl_worker.go
Outdated
@@ -519,13 +523,18 @@ func (w *worker) runDDLJob(d *ddlCtx, t *meta.Meta, job *model.Job) (ver int64, | |||
if err != nil { | |||
// If job is not cancelled, we should log this error. | |||
if job.State != model.JobStateCancelled { | |||
log.Errorf("[ddl-%s] run DDL job err %v", w, errors.ErrorStack(err)) | |||
log.Errorf("[ddl-%s] run DDL job err %v job query %s ", w, errors.ErrorStack(err), job.Query) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
, job query
ddl/ddl_worker.go
Outdated
} | ||
|
||
job.Error = toTError(err) | ||
job.ErrorCount++ | ||
if job.ErrorCount > errorCountlimit { | ||
job.State = model.JobStateCancelling | ||
metrics.DDLWorkerHistogram.WithLabelValues(metrics.WorkerCancelDDLJob, metrics.RetLabel(err)).Observe(time.Since(startTime).Seconds()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
startTime
is the job start execute time or start time
of the last execute the job?
ddl/ddl_worker.go
Outdated
@@ -51,6 +51,8 @@ const ( | |||
waitDependencyJobInterval = 200 * time.Millisecond | |||
// noneDependencyJob means a job has no dependency-job. | |||
noneDependencyJob = 0 | |||
// errorCountlimit limit the number of retries. | |||
errorCountlimit = 512 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make this be configurable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@winkyao PTAL.
sessionctx/variable/varsutil.go
Outdated
// SetDDLErrorRetryLimit sets ddlErrorRetryLimit count. | ||
func SetDDLErrorRetryLimit(cnt int32) { | ||
if cnt < minDDLErrorRetryLimit { | ||
cnt = ddlErrorRetryLimit |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not set to minDDLErrorRetryLimit?
@@ -223,4 +223,8 @@ func (s *testVarsutilSuite) TestVarsutil(c *C) { | |||
c.Assert(err, IsNil) | |||
c.Assert(val, Equals, "1") | |||
c.Assert(v.EnableTablePartition, IsTrue) | |||
|
|||
c.Assert(GetDDLErrorRetryLimit(), Equals, int32(DefTiDBDDLErrorRetryLimit)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add more test case like set -1 to DefTiDBDDLErrorRetryLimit. Please make your unit tests cover more cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@winkyao Done.
ddl/ddl_worker.go
Outdated
} | ||
|
||
job.Error = toTError(err) | ||
job.ErrorCount++ | ||
if job.ErrorCount > int64(variable.GetDDLErrorRetryLimit()) { | ||
startTime := time.Now() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why startTime
is not job.StartTS
@@ -519,13 +520,19 @@ func (w *worker) runDDLJob(d *ddlCtx, t *meta.Meta, job *model.Job) (ver int64, | |||
if err != nil { | |||
// If job is not cancelled, we should log this error. | |||
if job.State != model.JobStateCancelled { | |||
log.Errorf("[ddl-%s] run DDL job err %v", w, errors.ErrorStack(err)) | |||
log.Errorf("[ddl-%s] run DDL job err %v, job query %s ", w, errors.ErrorStack(err), job.Query) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add metrics(WorkerCancelDDLJob) here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zimulala Done.
sessionctx/variable/tidb_vars.go
Outdated
@@ -214,6 +217,7 @@ const ( | |||
DefTiDBDDLReorgWorkerCount = 16 | |||
DefTiDBHashAggPartialConcurrency = 4 | |||
DefTiDBHashAggFinalConcurrency = 4 | |||
DefTiDBDDLErrorRetryLimit = 512 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this value be too small?
ddl/ddl_worker.go
Outdated
} | ||
|
||
job.Error = toTError(err) | ||
job.ErrorCount++ | ||
if job.ErrorCount > int64(variable.GetDDLErrorRetryLimit()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to add a log here?
sessionctx/variable/session.go
Outdated
@@ -293,6 +293,9 @@ type SessionVars struct { | |||
EnableStreaming bool | |||
|
|||
writeStmtBufs WriteStmtBufs | |||
|
|||
// DDLErrorRetryLimit limits the number of error retries that occur in ddl job. | |||
DDLErrorRetryLimit int64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this variable useful?
b6d83fa
to
1f19c1b
Compare
/run-all-tests |
@crazycs520 @winkyao @zimulala PTAL . |
1 similar comment
@crazycs520 @winkyao @zimulala PTAL . |
} | ||
|
||
job.Error = toTError(err) | ||
job.ErrorCount++ | ||
if job.ErrorCount > int64(variable.GetDDLErrorRetryLimit()) && job.Type != model.ActionAddIndex { | ||
log.Infof("[ddl-%s] DDL job over maximum retry count is canceled because %v, job query %s", w, errors.ErrorStack(err), job.Query) | ||
job.State = model.JobStateCancelling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a problem with the following scenario?
If the operation of “Add column” updates to “Delete only”, then we cancel this job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zimulala I write a test and canceled this job, but no error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ciscoxll
If the column is “drop column” updates to “delete only”, but this job occurs many errors then this job cancel successfully. Then we return an error to the client, but this column's state is "delete only".
I think this will be a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zimulala I will write a rollback
in another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handle it on #9295
@winkyao PTAL. |
if job.ErrorCount > int64(variable.GetDDLErrorRetryLimit()) && job.Type != model.ActionAddIndex { | ||
log.Infof("[ddl-%s] DDL job over maximum retry count is canceled because %v, job query %s", w, errors.ErrorStack(err), job.Query) | ||
job.State = model.JobStateCancelling | ||
metrics.DDLWorkerHistogram.WithLabelValues(metrics.WorkerCancelDDLJob, job.Type.String(), metrics.RetLabel(err)).Observe(time.Since(model.TSConvert2Time(job.StartTS)).Seconds()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Repeat with line#529?
What problem does this PR solve?
This PR solves the problem of infinite retry when a DDL job error occurs.
What is changed and how it works?
metrics
.set @@global.tidb_ddl_error_retry_limit = 10000
.Check List
Tests
PTAL @tiancaiamao @winkyao @zimulala .
This change is