Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

health: detect missing task checks #7366

Closed
wants to merge 3 commits into from

Conversation

notnoop
Copy link
Contributor

@notnoop notnoop commented Mar 17, 2020

Fixes a bug where an allocation is considered healthy if some of the
tasks are being restarted and as such, their checks aren't tracked by
consul agent client.

The underlying problem is that allocation registration in consul
agent/client code is mutable: tasks get removed as services from consul,
prior to stopping/restarting to allow for graceful removal from LBs.
The downside is that the health tracker may consider the allocation as
healthy if one of the task is down.

This uses the simplest approach to patch the problem by detecting the
number of expected checks against the registered checks.

I don't anticipate disrepency of counters. sreg.Checks should only
contain checks that nomad agent explicitly registered and filter out
unexpected or unrelated checks:

for _, treg := range reg.Tasks {
for serviceID, sreg := range treg.Services {
sreg.Service = services[serviceID]
for checkID := range sreg.checkIDs {
if check, ok := checks[checkID]; ok {
sreg.Checks = append(sreg.Checks, check)
}
}
}
}

.

A better approach would have been to strictly compare the found check
IDs against an immutable list of expected IDs. This sadly requires
significant code changes both to task runner service hooks and consul
hooks, that I'm not comfortable so close to cutting a new release.

Fixes #7320

Mahmood Ali added 2 commits March 17, 2020 11:43
Add tests to check for failing or missing service checks in consul
update.
Fixes a bug where an allocation is considered healthy if some of the
tasks are being restarted and as such, their checks aren't tracked by
consul agent client.

The underlying problem is that allocation registration in consul
agent/client code is mutable: tasks get removed as services from consul,
prior to stopping/restarting to allow for graceful removal from LBs.
The downside is that the health tracker may consider the allocation as
healthy if one of the task is down.

This uses the simplest approach to patch the problem by detecting the
number of expected checks against the registered checks.

I don't anticipate disrepency of counters.  `sreg.Checks` should only
contain checks that nomad agent explicitly registered and filter out
unexpected or unrelated checks:
https://github.com/hashicorp/nomad/blob/0ecda992317d3300e1c1da05170f8bba18410357/command/agent/consul/client.go#L1138-L1147
.

A better approach would have been to strictly compare the found check
IDs against an immutable list of expected IDs.  This sadly requires
significant code changes both to task runner service hooks and consul
hooks, that I'm not comfortable so close to cutting a new release.
@notnoop notnoop requested a review from schmichael March 17, 2020 16:24
@notnoop notnoop self-assigned this Mar 17, 2020
@notnoop notnoop added this to Triaged in Nomad - Community Issues Triage via automation Mar 17, 2020
Comment on lines 376 to 385
select {
case <-time.After(5 * time.Second):
// great no healthy status
case health := <-hs.healthCh:
require.False(health.healthy)

// Unhealthy allocs shouldn't emit task events
ev := health.taskEvents[task.Name]
require.NotNilf(ev, "%#v", health.taskEvents)
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We seem to lack tests for health check logic for failure and it's not obvious to me how to effectively test this quickly without changing many interfaces :(. Waiting 5 seconds is pretty long. Suggestions?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting HealthyDeadline=2 * time.Second might be safe and tighten things up a little bit. Not sure how else to test this without extensive changes. t.Parallel should help us out a lot here, but 5s minimum would be unfortunate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also testing in the allochealth package may allow peeking at internals in a time-independent way? I'm not sure after a quick peek, but the deadline is only taken into account in this package.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a test in allochealth that's not so time sensitive and lowered the tests here to 2s.

Copy link
Member

@schmichael schmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we still need to take group checks into account!

//TODO(schmichael) for now skip unknown tasks as
//they're task group services which don't currently
//support checks anyway
if v, ok := t.taskHealth[task]; ok {
v.state = state
}

Out of scope for this PR so I opened this issue: #7375

Comment on lines 382 to 384
// Unhealthy allocs shouldn't emit task events
ev := health.taskEvents[task.Name]
require.NotNilf(ev, "%#v", health.taskEvents)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says "unhealthy alloc shouldn't emit task events" yet the assertion is non-nil.

I'm also a little confused why both cases (timeout and health=false) seem valid? Is this test non-deterministic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry - This case case never actually gets exercised - it's dead code, I was trying to get some error messages. will remove it.

Comment on lines 376 to 385
select {
case <-time.After(5 * time.Second):
// great no healthy status
case health := <-hs.healthCh:
require.False(health.healthy)

// Unhealthy allocs shouldn't emit task events
ev := health.taskEvents[task.Name]
require.NotNilf(ev, "%#v", health.taskEvents)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting HealthyDeadline=2 * time.Second might be safe and tighten things up a little bit. Not sure how else to test this without extensive changes. t.Parallel should help us out a lot here, but 5s minimum would be unfortunate.

Comment on lines 376 to 385
select {
case <-time.After(5 * time.Second):
// great no healthy status
case health := <-hs.healthCh:
require.False(health.healthy)

// Unhealthy allocs shouldn't emit task events
ev := health.taskEvents[task.Name]
require.NotNilf(ev, "%#v", health.taskEvents)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also testing in the allochealth package may allow peeking at internals in a time-independent way? I'm not sure after a quick peek, but the deadline is only taken into account in this package.

if foundChecks < t.consulCheckCount {
// didn't see all the checks we expect, task might be slow to start
// or restarting tasks that had their service hooks deregistered
passed = false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this logic work when it is a migration for a batch job:

minHealthyTime = strategy.MinHealthyTime
?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe so - the assumption this depends on that checks are unique and that job definition has a 1 to 1 mapping with consul agent check registration, regardless of exact parameter or minHealthyTime. This needs to be tweaked for task group level checks, but will address that in a follow up PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry should have added more for my thinking. If it is a batch job, the tasks can finish and then the number of checks that will be registered in Consul will be less than those defined in the job spec.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah - excellent point. Going to drawing board a bit to think this through a bit more. Not sure if I fully understand expectations of service health checks in batch - but it's definitely a problem for service jobs with init containers with task lifecycle/dependencies features. Thanks for raising it.

Copy link
Contributor Author

@notnoop notnoop Mar 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked - and looks like batch jobs don't support migrate or update/deployments [1] - do you know when this code is relevant for batch by any chance?

[1] https://github.com/hashicorp/nomad/blob/v0.10.4/nomad/structs/structs.go#L4997-L5021

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry for the noise! Thought batch supported migrate!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:) You helped found a bug though - addressed it with a different approach in #7383 .

@notnoop
Copy link
Contributor Author

notnoop commented Mar 23, 2020

Closing this one as it's superceeded by #7383

@notnoop notnoop closed this Mar 23, 2020
Nomad - Community Issues Triage automation moved this from Triaged to Done Mar 23, 2020
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 14, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Nomad incorrectly marking unhealthy allocs as healthy during rolling upgrade
3 participants