Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add health-check-monitor #426

Merged
merged 1 commit into from
May 28, 2020

Conversation

abansal4032
Copy link
Contributor

The PR contains the health-monitor implementation as a NPD monitor.
This is implemented as a new binary : health-checker, which can be enabled using config/container-runtime-health-checker.json and config/kubelet-health-checker.json to check and repair container runtime and kubelet services respectively.

@k8s-ci-robot
Copy link
Contributor

Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA.

It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 11, 2020
@k8s-ci-robot
Copy link
Contributor

Welcome @abansal4032!

It looks like this is your first PR to kubernetes/node-problem-detector 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/node-problem-detector has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot
Copy link
Contributor

Hi @abansal4032. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label May 11, 2020
@abansal4032 abansal4032 force-pushed the health-check-monitor branch from 3ed3c26 to fced857 Compare May 12, 2020 00:05
@abansal4032
Copy link
Contributor Author

/assign @Random-Liu

@abansal4032
Copy link
Contributor Author

/cc @yguo0905

@k8s-ci-robot k8s-ci-robot requested a review from yguo0905 May 12, 2020 00:37
@abansal4032
Copy link
Contributor Author

/recheck-cla

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels May 12, 2020
@wangzhen127
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels May 25, 2020
config/container-runtime-health-checker.json Outdated Show resolved Hide resolved
config/container-runtime-health-checker.json Outdated Show resolved Hide resolved
],
"rules": [
{
"type": "temporary",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we only want events? I think we can just use permanent type here. NPD will send events when setting the condition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed this to add a permanent condition here.

pkg/healthchecker/health_checker.go Outdated Show resolved Hide resolved
// Returns true if healthy, false otherwise.
func (hc *healthChecker) CheckHealth() bool {
// Poll till the timeout for the component to be up.
if wait.PollImmediate(types.PollInterval, hc.timeout, func() (bool, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assume kubelet was healthy for a long time. Suddenly, kubelet has some problem. This will wait for 2m (--wait-time) before trying to repair. In health-monitor.sh, it will only wait for 10sec.

It is hard to combine all the use cases for 1) initial wait time, 2) cool down time, and 3) health check timeout. Maybe ok to combine initial wait time and cool down time, but probably need a separate parameter for health check timeout.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Initial wait time : NPD starts only after API server is up and hence the first invocation of the plugin guarantees kubelet and docker to be up.
  2. Cool down time : Changed wait-time to cooldown-time to wait after repair is attempted.
  3. Health check timeout : Added a new flag defaulting to 10s.

@abansal4032 abansal4032 force-pushed the health-check-monitor branch from fced857 to f09501f Compare May 26, 2020 18:04
@wangzhen127
Copy link
Member

/cc @wangzhen127

@k8s-ci-robot k8s-ci-robot requested a review from wangzhen127 May 26, 2020 18:36
if hco.ContainerRuntime != types.DockerRuntime && hco.ContainerRuntime != types.ContainerdRuntime {
panic("The container-runtime specified is not supported. Supported runtimes are : <docker/containerd>")
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also add if it is containerd, the crictl path should be non-empty?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The path has a default value which is used in case it is empty.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about if someone explicitly use --crictl-path=""? Would that also be using default value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it will use the empty string. I have added a test to detect that. But in general, if that flag is set there is no way of making sure that path exists in the test.

config/health-checker-container-runtime.json Outdated Show resolved Hide resolved
config/health-checker-kubelet.json Show resolved Hide resolved
cmd/healthchecker/options/options.go Outdated Show resolved Hide resolved
config/health-checker-kubelet.json Outdated Show resolved Hide resolved
pkg/healthchecker/health_checker.go Outdated Show resolved Hide resolved
pkg/healthchecker/health_checker.go Outdated Show resolved Hide resolved
import "time"

const (
HttpTimeout = 10 * time.Second
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are consts, not types?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah these are just consts. Since the usage is across packages, this felt like a correct place to have them. Similar pattern is followed here : https://github.com/kubernetes/node-problem-detector/blob/master/pkg/custompluginmonitor/types/types.go#L26

pkg/healthchecker/health_checker.go Outdated Show resolved Hide resolved
@abansal4032 abansal4032 force-pushed the health-check-monitor branch from f09501f to e88e6e0 Compare May 26, 2020 20:13
Copy link
Member

@wangzhen127 wangzhen127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one nit

if hco.ContainerRuntime != types.DockerRuntime && hco.ContainerRuntime != types.ContainerdRuntime {
panic("The container-runtime specified is not supported. Supported runtimes are : <docker/containerd>")
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about if someone explicitly use --crictl-path=""? Would that also be using default value?

@abansal4032 abansal4032 force-pushed the health-check-monitor branch from e88e6e0 to e1e8392 Compare May 27, 2020 00:36
@wangzhen127
Copy link
Member

/lgtm
/hold

Check with @Random-Liu to see if he wants to take a look (I remember he said he would take a look this week?). Feel free to remove hold if he does not need to review.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 27, 2020
@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels May 27, 2020
cmd/healthchecker/options/options.go Show resolved Hide resolved
func (hco *HealthCheckerOptions) ValidOrDie() {
// Make sure the component specified is valid.
if hco.Component != types.KubeletComponent && hco.Component != types.ContainerRuntimeComponent {
panic("The component specified is not supported. Supported components are : <docker/container-runtime>")
Copy link
Member

@Random-Liu Random-Liu May 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Supported components don't have kubelet?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was a typo there. Updated the flags.

func (hco *HealthCheckerOptions) ValidOrDie() {
// Make sure the component specified is valid.
if hco.Component != types.KubeletComponent && hco.Component != types.ContainerRuntimeComponent {
panic("The component specified is not supported. Supported components are : <docker/container-runtime>")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we return error instead of panic? So that we can return an Unknown status outside.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Component string
ContainerRuntime string
EnableRepair bool
CriCtlPath string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is the CRI socket path defined?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the flag for CRI socket path.

// HealthCheckerOptions are the options used to configure the health checker.
type HealthCheckerOptions struct {
Component string
ContainerRuntime string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope we can make the systemd service name configurable, so that this can be used to support cri-o as well.

Probably:

  • Component type: kubelet, docker, cri.
  • SystemdService: for kubelet, docker, default to the component name; for cri this is required if EnableRepair is enabled.

And the repair will just restart the systemd service.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restructured the logic to include the suggested changes.

func (hc *healthChecker) CheckHealth() bool {
// Poll till the health check timeout for the component to be up.
if err := wait.PollImmediate(hc.healthCheckTimeout, hc.healthCheckTimeout, func() (bool, error) {
healthy := hc.healthCheckFunc()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if the healthCheckFunc function itself stuck? We need a timeout to cancel that.

This is very important, because that is the most frequent way the health check would fail.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Included the timeout logic in the healthCheckFunc. Removed the polling in the wrapper function.

glog.Infof("health-checker: component is unhealthy, proceeding to repair")
hc.repairFunc()
// stall for cool down period after repairing
time.Sleep(hc.coolDownTime)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I feel like the status should be reported before the cooldown.

Can we use the systemd service startup time to implement cooldown? If the service hasn't run for 2 min yet, we don't repair it.

For example:
systemctl show docker --property=ActiveEnterTimestamp

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Since we are attempting the repair only if the component has been up for cool down period, the status report will not wait for cool down period.

}
// Use "crictl pods" for containerd health check.
return func() bool {
if err := execCommand(crictlPath, "pods"); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the CRI socket path configured? Are we going to assume it is configured on the node?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a flag to specify this.

@abansal4032 abansal4032 force-pushed the health-check-monitor branch from e1e8392 to f5c9425 Compare May 27, 2020 07:24
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 27, 2020
@abansal4032 abansal4032 force-pushed the health-check-monitor branch 3 times, most recently from 6d6b5a5 to f1bb113 Compare May 27, 2020 15:20
Copy link
Member

@Random-Liu Random-Liu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with nits

DefaultHealthCheckTimeout = 10 * time.Second
DefaultCmdTimeout = 10 * time.Second
DefaultCriCtl = "/usr/bin/crictl"
DefaultCricSocketPath = "unix:///var/run/containerd/containerd.sock"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/DefaultCricSocketPath/DefaultCriSocketPath

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// HealthCheckerOptions are the options used to configure the health checker.
type HealthCheckerOptions struct {
Component string
SystemService string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/SystemService/SystemdService

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

func (hco *HealthCheckerOptions) AddFlags(fs *pflag.FlagSet) {
fs.StringVar(&hco.Component, "component", types.KubeletComponent,
"The component to check health for. Supports kubelet, docker and cri")
fs.StringVar(&hco.SystemService, "system-service", "",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/system-service/systemd service.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

"The time to wait for the exec commands to complete.")
}

// ValidOrDie validates health checker command line options.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update the comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

case types.KubeletComponent:
return kubeletHealthCheck
case types.DockerComponent:
return func(timeout time.Duration) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Can we make the way we use timeout consistent among the 3 functions? Maybe outside function accept HealthCheckerOptions, internal function doesn't accept arguments?
  2. What is the relationship between HealthCheckTimeout and CmdTimeout? It is a bit confusing. :P

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Done. Made the kubelet function inline.
  2. HealthCheckTimeout is used to timeout actions in health check function. CmdTimeout is the timeout used for all other CLI commands run. For example : getUptime. Kept these two different because we might want to have a longer timeout on health check than other commands.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know what they are, but it is hard for users to understand when see the flag description.

Do we need to make command timeout configurable? Maybe have a constant value for it? I don't think people will need to tweak it. If they do, we can think about how to name that flag then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Made it a constant and removed from the flags.

@abansal4032 abansal4032 force-pushed the health-check-monitor branch from f1bb113 to 5d48436 Compare May 27, 2020 17:25
"k8s.io/node-problem-detector/pkg/healthchecker/types"
)

func TestValidOrDie(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestIsValid

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

case types.KubeletComponent:
return kubeletHealthCheck
case types.DockerComponent:
return func(timeout time.Duration) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know what they are, but it is hard for users to understand when see the flag description.

Do we need to make command timeout configurable? Maybe have a constant value for it? I don't think people will need to tweak it. If they do, we can think about how to name that flag then.

@abansal4032 abansal4032 force-pushed the health-check-monitor branch from 5d48436 to 44dc4aa Compare May 27, 2020 21:11
@abansal4032
Copy link
Contributor Author

/retest

1 similar comment
@abansal4032
Copy link
Contributor Author

/retest

@Random-Liu
Copy link
Member

/lgtm
/approve

@Random-Liu Random-Liu removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 28, 2020
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 28, 2020
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abansal4032, Random-Liu, wangzhen127

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [Random-Liu,wangzhen127]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 452818c into kubernetes:master May 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants