Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consul: correctly interpret missing consul checks as unhealthy #15822

Merged
merged 4 commits into from
Jan 19, 2023

Conversation

shoenig
Copy link
Member

@shoenig shoenig commented Jan 19, 2023

This PR fixes a bug where Nomad assumed any registered Checks would exist
in the service registration coming back from Consul. In some cases, the
Consul may be slow in processing the check registration, and the response
object would not contain checks. Nomad would then scan the empty response
looking for Checks with failing health status, finding none, and then
marking a task/alloc as healthy.

In reality, we must always use Nomad's view of what checks should exist as
the source of truth, and compare that with the response Consul gives us,
making sure they match, before scanning the Consul response for failing
check statuses.

Fixes #15536

This PR fixes a bug where Nomad assumed any registered Checks would exist
in the service registration coming back from Consul. In some cases, the
Consul may be slow in processing the check registration, and the response
object would not contain checks. Nomad would then scan the empty response
looking for Checks with failing health status, finding none, and then
marking a task/alloc as healthy.

In reality, we must always use Nomad's view of what checks should exist as
the source of truth, and compare that with the response Consul gives us,
making sure they match, before scanning the Consul response for failing
check statuses.

Fixes #15536
@shoenig shoenig changed the title [no ci] consul: correctly understand missing consul checks as unhealthy consul: correctly understand missing consul checks as unhealthy Jan 19, 2023
@shoenig shoenig changed the title consul: correctly understand missing consul checks as unhealthy consul: correctly interpret missing consul checks as unhealthy Jan 19, 2023
@shoenig shoenig added this to the 1.4.x milestone Jan 19, 2023
@shoenig shoenig added the backport/1.4.x backport to 1.4.x release line label Jan 19, 2023
@shoenig shoenig marked this pull request as ready for review January 19, 2023 15:44
Copy link
Contributor

@lgfa29 lgfa29 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Minor coding nit-picking, but not a blocker 👍

Comment on lines 531 to 554
expChecks := set.New[string](10)
expCount := 0
regChecks := set.New[string](10)
regCount := 0
for _, service := range services {
for _, check := range service.Checks {
expChecks.Insert(check.Name)
expCount++
}
}
for _, task := range registrations.Tasks {
for _, service := range task.Services {
for _, check := range service.Checks {
regChecks.Insert(check.Name)
regCount++
}
}
}
if expCount != regCount {
return false
}
if !expChecks.Equal(regChecks) {
return false
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit-picking, but do we need a set and counter here? I think using a map[service name]: count would be enough and maybe easier to read?

}
// scan for missing or unhealthy consul checks
if !evaluateConsulChecks(t.tg, allocReg) {
passed = false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it meaningful that t.setCheckHealth(false) is no longer called alongside this bool being set? I see that it is called elsewhere, but haven't chased down the potential implications of its removal here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good catch, and I believe this is important in the edge case where

  1. checks become healthy
  2. tasks become healthy
  3. start minimum healthy time
  4. checks become unhealthy
  5. minimum healthy time ends <- we never updated check status to unhealthy, so we'll incorrectly report they are healthy

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this back in, and also added TestTracker_ConsulChecks_HealthToUnhealthy to cover the case where health checks transition from healthy to unhealthy during the minimum health period.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.4.x backport to 1.4.x release line
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Nomad marks deployments as successful, before Consul checks are passed.
3 participants