Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restart unhealthy tasks #3105

Merged
merged 33 commits into from
Sep 18, 2017
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
a720bb5
Add restart fields
schmichael Aug 24, 2017
bd1a342
Nest restart fields in CheckRestart
schmichael Aug 25, 2017
1608e59
Add check watcher for restarting unhealthy tasks
schmichael Aug 26, 2017
ebbf87f
Use existing restart policy infrastructure
schmichael Sep 10, 2017
555d1e2
on_warning=false -> ignore_warnings=false
schmichael Sep 11, 2017
c2d895d
Add comments and move delay calc to TaskRunner
schmichael Sep 11, 2017
78c72f8
Default grace period to 1s
schmichael Sep 11, 2017
7e103f6
Document new check_restart stanza
schmichael Sep 11, 2017
850d991
Add changelog entry for #3105
schmichael Sep 11, 2017
3db835c
Improve check watcher logging and add tests
schmichael Sep 13, 2017
526528c
Removed partially implemented allocLock
schmichael Sep 13, 2017
568b963
Remove unused lastStart field
schmichael Sep 13, 2017
9fb2865
Fix whitespace
schmichael Sep 13, 2017
092057a
Canonicalize and Merge CheckRestart in api
schmichael Sep 14, 2017
8b8c164
Wrap check watch updates in a struct
schmichael Sep 14, 2017
237c096
Simplify from 2 select loops to one
schmichael Sep 14, 2017
f8e872c
RestartDelay isn't needed as checks are re-added on restarts
schmichael Sep 14, 2017
40ed262
Handle multiple failing checks on a single task
schmichael Sep 14, 2017
5cd1d57
Watched -> TriggersRestart
schmichael Sep 14, 2017
10dc1c7
DRY up restart handling a bit.
schmichael Sep 14, 2017
3c0a42b
Rename unhealthy var and fix test indeterminism
schmichael Sep 14, 2017
6f72270
Test check watch updates
schmichael Sep 14, 2017
a508bb9
Fold SetFailure into SetRestartTriggered
schmichael Sep 14, 2017
5141c95
Add check_restart to jobspec tests
schmichael Sep 14, 2017
1564e1c
Move check_restart to its own section.
schmichael Sep 14, 2017
8014762
Add comments
schmichael Sep 15, 2017
cde908e
Cleanup and test restart failure code
schmichael Sep 15, 2017
fa836d8
Name const after what it represents
schmichael Sep 15, 2017
924813d
Test converting CheckRestart from api->structs
schmichael Sep 15, 2017
6bcf019
Test CheckRestart.Validate
schmichael Sep 15, 2017
10ae18c
Minor corrections to check_restart docs
schmichael Sep 15, 2017
967825d
Fix comments: task -> check
schmichael Sep 15, 2017
3d7446d
@dadgar is better at words than me
schmichael Sep 15, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions api/tasks.go
Original file line number Diff line number Diff line change
Expand Up @@ -187,6 +187,8 @@ func (s *Service) Canonicalize(t *Task, tg *TaskGroup, job *Job) {

s.CheckRestart.Canonicalize()

// Canonicallize CheckRestart on Checks and merge Service.CheckRestart
// into each check.
for _, c := range s.Checks {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Place a comment

c.CheckRestart.Canonicalize()
c.CheckRestart = c.CheckRestart.Merge(s.CheckRestart)
Expand Down
63 changes: 34 additions & 29 deletions client/restarts.go
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,9 @@ func (r *RestartTracker) SetWaitResult(res *dstructs.WaitResult) *RestartTracker
}

// SetRestartTriggered is used to mark that the task has been signalled to be
// restarted
// restarted. Setting the failure to true restarts according to the restart
// policy. When failure is false the task is restarted without considering the
// restart policy.
func (r *RestartTracker) SetRestartTriggered(failure bool) *RestartTracker {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on the param

r.lock.Lock()
defer r.lock.Unlock()
Expand Down Expand Up @@ -143,39 +145,42 @@ func (r *RestartTracker) GetState() (string, time.Duration) {
}

// Handle restarts due to failures
if r.failure {
if r.startErr != nil {
// If the error is not recoverable, do not restart.
if !structs.IsRecoverable(r.startErr) {
r.reason = ReasonUnrecoverableErrror
return structs.TaskNotRestarting, 0
}
} else if r.waitRes != nil {
// If the task started successfully and restart on success isn't specified,
// don't restart but don't mark as failed.
if r.waitRes.Successful() && !r.onSuccess {
r.reason = "Restart unnecessary as task terminated successfully"
return structs.TaskTerminated, 0
}
}
if !r.failure {
return "", 0
}

if r.count > r.policy.Attempts {
if r.policy.Mode == structs.RestartPolicyModeFail {
r.reason = fmt.Sprintf(
`Exceeded allowed attempts %d in interval %v and mode is "fail"`,
r.policy.Attempts, r.policy.Interval)
return structs.TaskNotRestarting, 0
} else {
r.reason = ReasonDelay
return structs.TaskRestarting, r.getDelay()
}
if r.startErr != nil {
// If the error is not recoverable, do not restart.
if !structs.IsRecoverable(r.startErr) {
r.reason = ReasonUnrecoverableErrror
return structs.TaskNotRestarting, 0
}
} else if r.waitRes != nil {
// If the task started successfully and restart on success isn't specified,
// don't restart but don't mark as failed.
if r.waitRes.Successful() && !r.onSuccess {
r.reason = "Restart unnecessary as task terminated successfully"
return structs.TaskTerminated, 0
}
}

r.reason = ReasonWithinPolicy
return structs.TaskRestarting, r.jitter()
// If this task has been restarted due to failures more times
// than the restart policy allows within an interval fail
// according to the restart policy's mode.
if r.count > r.policy.Attempts {
if r.policy.Mode == structs.RestartPolicyModeFail {
r.reason = fmt.Sprintf(
`Exceeded allowed attempts %d in interval %v and mode is "fail"`,
r.policy.Attempts, r.policy.Interval)
return structs.TaskNotRestarting, 0
} else {
r.reason = ReasonDelay
return structs.TaskRestarting, r.getDelay()
}
}

return "", 0
r.reason = ReasonWithinPolicy
return structs.TaskRestarting, r.jitter()
}

// getDelay returns the delay time to enter the next interval.
Expand Down
13 changes: 13 additions & 0 deletions client/restarts_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -104,6 +104,19 @@ func TestClient_RestartTracker_RestartTriggered(t *testing.T) {
}
}

func TestClient_RestartTracker_RestartTriggered_Failure(t *testing.T) {
t.Parallel()
p := testPolicy(true, structs.RestartPolicyModeFail)
p.Attempts = 1
rt := newRestartTracker(p, structs.JobTypeService)
if state, when := rt.SetRestartTriggered(true).GetState(); state != structs.TaskRestarting || when == 0 {
t.Fatalf("expect restart got %v %v", state, when)
}
if state, when := rt.SetRestartTriggered(true).GetState(); state != structs.TaskNotRestarting || when != 0 {
t.Fatalf("expect failed got %v %v", state, when)
}
}

func TestClient_RestartTracker_StartError_Recoverable_Fail(t *testing.T) {
t.Parallel()
p := testPolicy(true, structs.RestartPolicyModeFail)
Expand Down
4 changes: 2 additions & 2 deletions client/task_runner.go
Original file line number Diff line number Diff line change
Expand Up @@ -789,8 +789,8 @@ OUTER:
return
}
case structs.VaultChangeModeRestart:
const failure = false
r.Restart("vault", "new Vault token acquired", failure)
const noFailure = false
r.Restart("vault", "new Vault token acquired", noFailure)
case structs.VaultChangeModeNoop:
fallthrough
default:
Expand Down
4 changes: 2 additions & 2 deletions command/agent/consul/check_watcher.go
Original file line number Diff line number Diff line change
Expand Up @@ -267,7 +267,7 @@ func (w *checkWatcher) Run(ctx context.Context) {
}
}

// Watch a task and restart it if unhealthy.
// Watch a check and restart its task if unhealthy.
func (w *checkWatcher) Watch(allocID, taskName, checkID string, check *structs.ServiceCheck, restarter TaskRestarter) {
if !check.TriggersRestarts() {
// Not watched, noop
Expand Down Expand Up @@ -302,7 +302,7 @@ func (w *checkWatcher) Watch(allocID, taskName, checkID string, check *structs.S
}
}

// Unwatch a task.
// Unwatch a check.
func (w *checkWatcher) Unwatch(cid string) {
c := checkWatchUpdate{
checkID: cid,
Expand Down
10 changes: 10 additions & 0 deletions command/agent/job_endpoint_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -1216,6 +1216,11 @@ func TestJobs_ApiJobToStructsJob(t *testing.T) {
Interval: 4 * time.Second,
Timeout: 2 * time.Second,
InitialStatus: "ok",
CheckRestart: &api.CheckRestart{
Limit: 3,
Grace: helper.TimeToPtr(10 * time.Second),
IgnoreWarnings: true,
},
},
},
},
Expand Down Expand Up @@ -1406,6 +1411,11 @@ func TestJobs_ApiJobToStructsJob(t *testing.T) {
Interval: 4 * time.Second,
Timeout: 2 * time.Second,
InitialStatus: "ok",
CheckRestart: &structs.CheckRestart{
Limit: 3,
Grace: 10 * time.Second,
IgnoreWarnings: true,
},
},
},
},
Expand Down
7 changes: 4 additions & 3 deletions nomad/structs/structs.go
Original file line number Diff line number Diff line change
Expand Up @@ -2780,15 +2780,16 @@ func (c *CheckRestart) Validate() error {
return nil
}

var mErr multierror.Error
if c.Limit < 0 {
return fmt.Errorf("limit must be greater than or equal to 0 but found %d", c.Limit)
mErr.Errors = append(mErr.Errors, fmt.Errorf("limit must be greater than or equal to 0 but found %d", c.Limit))
}

if c.Grace < 0 {
return fmt.Errorf("grace period must be greater than or equal to 0 but found %d", c.Grace)
mErr.Errors = append(mErr.Errors, fmt.Errorf("grace period must be greater than or equal to 0 but found %d", c.Grace))
}

return nil
return mErr.ErrorOrNil()
}

const (
Expand Down
18 changes: 18 additions & 0 deletions nomad/structs/structs_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -1154,6 +1154,24 @@ func TestTask_Validate_Service_Check(t *testing.T) {
}
}

func TestTask_Validate_Service_Check_CheckRestart(t *testing.T) {
invalidCheckRestart := &CheckRestart{
Limit: -1,
Grace: -1,
}

err := invalidCheckRestart.Validate()
assert.NotNil(t, err, "invalidateCheckRestart.Validate()")
assert.Len(t, err.(*multierror.Error).Errors, 2)

validCheckRestart := &CheckRestart{}
assert.Nil(t, validCheckRestart.Validate())

validCheckRestart.Limit = 1
validCheckRestart.Grace = 1
assert.Nil(t, validCheckRestart.Validate())
}

func TestTask_Validate_LogConfig(t *testing.T) {
task := &Task{
LogConfig: DefaultLogConfig(),
Expand Down
9 changes: 5 additions & 4 deletions website/source/docs/job-specification/check_restart.html.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,8 @@ unhealthy for the `limit` specified in a `check_restart` stanza, it is
restarted according to the task group's [`restart` policy][restart_stanza]. The
`check_restart` settings apply to [`check`s][check_stanza], but may also be
placed on [`service`s][service_stanza] to apply to all checks on a service.
`check_restart` settings on `service` will only overwrite unset `check_restart`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If check_restart is set on both the check and service, the stanza's are merged with the check values taking precedence.

settings on `checks.`

```hcl
job "mysql" {
Expand Down Expand Up @@ -66,7 +68,6 @@ job "mysql" {
check_restart {
limit = 3
grace = "90s"

ignore_warnings = false
}
}
Expand All @@ -78,7 +79,7 @@ job "mysql" {

- `limit` `(int: 0)` - Restart task when a health check has failed `limit`
times. For example 1 causes a restart on the first failure. The default,
`0`, disables healtcheck based restarts. Failures must be consecutive. A
`0`, disables health check based restarts. Failures must be consecutive. A
single passing check will reset the count, so flapping services may not be
restarted.

Expand Down Expand Up @@ -124,8 +125,8 @@ restart {
```

The [`restart` stanza][restart_stanza] controls the restart behavior of the
task. In this case it will wait 10 seconds before restarting. Note that even if
the check passes in this time the restart will still occur.
task. In this case it will stop the task and then wait 10 seconds before
starting it again.

Once the task restarts Nomad waits the `grace` period again before starting to
check the task's health.
Expand Down