Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

health: detect failing tasks #7383

Merged
merged 4 commits into from
Mar 25, 2020
Merged

health: detect failing tasks #7383

merged 4 commits into from
Mar 25, 2020

Conversation

notnoop
Copy link
Contributor

@notnoop notnoop commented Mar 18, 2020

Alternative to #7366 that is more task dependencies friendly.

Fixes a bug where an allocation is considered healthy if some of the
tasks are being restarted and as such, their checks aren't tracked by
consul agent client.

Here, we fix the immediate case by ensuring that an alloc is healthy
only if tasks are running and the registered checks at the time are
healthy.

Previously, health tracker tracked task "health" independently from
checks and leads to problems when a task restarts. Consider the
following series of events:

  1. all tasks start running -> tracker.tasksHealthy is true
  2. one task has unhealthy checks and get restarted
  3. remaining checks are healthy -> tracker.checksHealthy is true
  4. propagate health status now that tracker.tasksHealthy and
    tracker.checksHealthy.

This change ensures that we accurately use the latest status of tasks
and checks regardless of their status changes.

Also, ensures that we only consider check health after tasks are
considered healthy, otherwise we risk trusting incomplete checks.

This approach accommodates task dependencies well. Service jobs can have
prestart short-lived tasks that will terminate before main process runs.
These dead tasks that complete successfully will not negate health
status.

I also included some fixes here targeting task dependencies (fyi/ @jazzyfresh ), so that successfully completed lifecycle non-sidecar tasks don't impact the allocation health.

Fixes #7320
Closes #7375

@notnoop notnoop self-assigned this Mar 18, 2020
@notnoop notnoop added this to Triaged in Nomad - Community Issues Triage via automation Mar 18, 2020

// check health should always be false if tasks are unhealthy
// as checks might be missing from unhealthy tasks
t.checksHealthy = healthy && t.tasksHealthy
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This calls out for having a state machine to represent these dependent states. Would appreciate suggestions, specially ones that don't require significant rewrite.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your logic seems sound even if the structure of this package is sub-optimal. 👍 to your approach.

Mahmood Ali added 2 commits March 22, 2020 11:13
Add tests to check for failing or missing service checks in consul
update.
Fixes a bug where an allocation is considered healthy if some of the
tasks are being restarted and as such, their checks aren't tracked by
consul agent client.

Here, we fix the immediate case by ensuring that an alloc is healthy
only if tasks are running and the registered checks at the time are
healthy.

Previously, health tracker tracked task "health" independently from
checks and leads to problems when a task restarts.  Consider the
following series of events:

1. all tasks start running -> `tracker.tasksHealthy` is true
2. one task has unhealthy checks and get restarted
3. remaining checks are healthy -> `tracker.checksHealthy` is true
4. propagate health status now that `tracker.tasksHealthy` and
`tracker.checksHealthy`.

This change ensures that we accurately use the latest status of tasks
and checks regardless of their status changes.

Also, ensures that we only consider check health after tasks are
considered healthy, otherwise we risk trusting incomplete checks.

This approach accomodates task dependencies well.  Service jobs can have
prestart short-lived tasks that will terminate before main process runs.
These dead tasks that complete successfully will not negate health
status.
@notnoop notnoop force-pushed the b-health-detect-failing-tasks branch from 2cbca7b to 314f345 Compare March 22, 2020 16:32
In service jobs, lifecycles non-sidecar task tweak health logic a bit:
they may terminate successfully without impacting alloc health, but fail
the alloc if they fail.

Sidecars should be treated just like a normal task.
Copy link
Contributor

@langmartin langmartin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good, just the one question about task state progression. I don't have a lot of context in this code, it would be good to get a second review.

t.setTaskHealth(false, true)
return
}

if state.State != structs.TaskStateRunning {
if state.State == structs.TaskStatePending {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably missing context: do tasks controlled by a restart block always go back through the pending state on their way to running?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. When a task fails, it moves to dead once it's beyond restart policy attempts, otherwise, it moves to pending until it's scheduled to run after restart policy delay.

@notnoop notnoop merged commit 4a27cdd into master Mar 25, 2020
Nomad - Community Issues Triage automation moved this from Triaged to Done Mar 25, 2020
@notnoop notnoop deleted the b-health-detect-failing-tasks branch March 25, 2020 10:30
Copy link
Member

@schmichael schmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! My comments are just suggestions if you circle back to this code to write further tests or something. Not critical.


for _, task := range t.tg.Tasks {
if task.Lifecycle != nil && !task.Lifecycle.Sidecar {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want a helper like connect has for this as it's easy for people to forget the initial nil check and cause a panic: https://github.com/hashicorp/nomad/blob/v0.10.5/nomad/structs/services.go#L585


// check health should always be false if tasks are unhealthy
// as checks might be missing from unhealthy tasks
t.checksHealthy = healthy && t.tasksHealthy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your logic seems sound even if the structure of this package is sub-optimal. 👍 to your approach.

Comment on lines +69 to +71
// lifecycleTasks is a set of tasks with lifecycle hook set and may
// terminate without affecting alloc health
lifecycleTasks map[string]bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's document that sidecars are explicitly excluded from this (code suggestions are disabled, sorry)

case structs.TaskStatePending:
return "Task not running by deadline", true
case structs.TaskStateDead:
// hook tasks are healthy when dead successfully
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could maybe get a bit more descriptive here? Just an idea:

// non-sidecar lifecycle tasks are expected to terminate and therefore healthy when dead

@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 13, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
4 participants