-
Notifications
You must be signed in to change notification settings - Fork 635
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add health-check-monitor #426
Add health-check-monitor #426
Conversation
Thanks for your pull request. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). 📝 Please follow instructions at https://git.k8s.io/community/CLA.md#the-contributor-license-agreement to sign the CLA. It may take a couple minutes for the CLA signature to be fully registered; after that, please reply here with a new comment and we'll verify. Thanks.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Welcome @abansal4032! |
Hi @abansal4032. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
3ed3c26
to
fced857
Compare
/assign @Random-Liu |
/cc @yguo0905 |
/recheck-cla |
/ok-to-test |
], | ||
"rules": [ | ||
{ | ||
"type": "temporary", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we only want events? I think we can just use permanent type here. NPD will send events when setting the condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed this to add a permanent condition here.
pkg/healthchecker/health_checker.go
Outdated
// Returns true if healthy, false otherwise. | ||
func (hc *healthChecker) CheckHealth() bool { | ||
// Poll till the timeout for the component to be up. | ||
if wait.PollImmediate(types.PollInterval, hc.timeout, func() (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assume kubelet was healthy for a long time. Suddenly, kubelet has some problem. This will wait for 2m (--wait-time
) before trying to repair. In health-monitor.sh, it will only wait for 10sec.
It is hard to combine all the use cases for 1) initial wait time, 2) cool down time, and 3) health check timeout. Maybe ok to combine initial wait time and cool down time, but probably need a separate parameter for health check timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Initial wait time : NPD starts only after API server is up and hence the first invocation of the plugin guarantees kubelet and docker to be up.
- Cool down time : Changed wait-time to cooldown-time to wait after repair is attempted.
- Health check timeout : Added a new flag defaulting to 10s.
fced857
to
f09501f
Compare
/cc @wangzhen127 |
if hco.ContainerRuntime != types.DockerRuntime && hco.ContainerRuntime != types.ContainerdRuntime { | ||
panic("The container-runtime specified is not supported. Supported runtimes are : <docker/containerd>") | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also add if it is containerd, the crictl path should be non-empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The path has a default value which is used in case it is empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about if someone explicitly use --crictl-path=""
? Would that also be using default value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, it will use the empty string. I have added a test to detect that. But in general, if that flag is set there is no way of making sure that path exists in the test.
pkg/healthchecker/types/types.go
Outdated
import "time" | ||
|
||
const ( | ||
HttpTimeout = 10 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those are consts, not types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah these are just consts. Since the usage is across packages, this felt like a correct place to have them. Similar pattern is followed here : https://github.com/kubernetes/node-problem-detector/blob/master/pkg/custompluginmonitor/types/types.go#L26
f09501f
to
e88e6e0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one nit
if hco.ContainerRuntime != types.DockerRuntime && hco.ContainerRuntime != types.ContainerdRuntime { | ||
panic("The container-runtime specified is not supported. Supported runtimes are : <docker/containerd>") | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about if someone explicitly use --crictl-path=""
? Would that also be using default value?
e88e6e0
to
e1e8392
Compare
/lgtm Check with @Random-Liu to see if he wants to take a look (I remember he said he would take a look this week?). Feel free to remove hold if he does not need to review. |
cmd/healthchecker/options/options.go
Outdated
func (hco *HealthCheckerOptions) ValidOrDie() { | ||
// Make sure the component specified is valid. | ||
if hco.Component != types.KubeletComponent && hco.Component != types.ContainerRuntimeComponent { | ||
panic("The component specified is not supported. Supported components are : <docker/container-runtime>") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Supported components don't have kubelet
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was a typo there. Updated the flags.
cmd/healthchecker/options/options.go
Outdated
func (hco *HealthCheckerOptions) ValidOrDie() { | ||
// Make sure the component specified is valid. | ||
if hco.Component != types.KubeletComponent && hco.Component != types.ContainerRuntimeComponent { | ||
panic("The component specified is not supported. Supported components are : <docker/container-runtime>") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we return error instead of panic? So that we can return an Unknown status outside.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
Component string | ||
ContainerRuntime string | ||
EnableRepair bool | ||
CriCtlPath string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is the CRI socket path defined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the flag for CRI socket path.
cmd/healthchecker/options/options.go
Outdated
// HealthCheckerOptions are the options used to configure the health checker. | ||
type HealthCheckerOptions struct { | ||
Component string | ||
ContainerRuntime string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope we can make the systemd service name configurable, so that this can be used to support cri-o as well.
Probably:
- Component type: kubelet, docker, cri.
- SystemdService: for
kubelet
,docker
, default to the component name; forcri
this is required ifEnableRepair
is enabled.
And the repair will just restart the systemd service.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Restructured the logic to include the suggested changes.
pkg/healthchecker/health_checker.go
Outdated
func (hc *healthChecker) CheckHealth() bool { | ||
// Poll till the health check timeout for the component to be up. | ||
if err := wait.PollImmediate(hc.healthCheckTimeout, hc.healthCheckTimeout, func() (bool, error) { | ||
healthy := hc.healthCheckFunc() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the healthCheckFunc
function itself stuck? We need a timeout to cancel that.
This is very important, because that is the most frequent way the health check would fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Included the timeout logic in the healthCheckFunc. Removed the polling in the wrapper function.
pkg/healthchecker/health_checker.go
Outdated
glog.Infof("health-checker: component is unhealthy, proceeding to repair") | ||
hc.repairFunc() | ||
// stall for cool down period after repairing | ||
time.Sleep(hc.coolDownTime) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I feel like the status should be reported before the cooldown.
Can we use the systemd service startup time to implement cooldown? If the service hasn't run for 2 min yet, we don't repair it.
For example:
systemctl show docker --property=ActiveEnterTimestamp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Since we are attempting the repair only if the component has been up for cool down period, the status report will not wait for cool down period.
pkg/healthchecker/health_checker.go
Outdated
} | ||
// Use "crictl pods" for containerd health check. | ||
return func() bool { | ||
if err := execCommand(crictlPath, "pods"); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is the CRI socket path configured? Are we going to assume it is configured on the node?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a flag to specify this.
e1e8392
to
f5c9425
Compare
6d6b5a5
to
f1bb113
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with nits
pkg/healthchecker/types/types.go
Outdated
DefaultHealthCheckTimeout = 10 * time.Second | ||
DefaultCmdTimeout = 10 * time.Second | ||
DefaultCriCtl = "/usr/bin/crictl" | ||
DefaultCricSocketPath = "unix:///var/run/containerd/containerd.sock" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/DefaultCricSocketPath/DefaultCriSocketPath
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
cmd/healthchecker/options/options.go
Outdated
// HealthCheckerOptions are the options used to configure the health checker. | ||
type HealthCheckerOptions struct { | ||
Component string | ||
SystemService string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/SystemService/SystemdService
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
cmd/healthchecker/options/options.go
Outdated
func (hco *HealthCheckerOptions) AddFlags(fs *pflag.FlagSet) { | ||
fs.StringVar(&hco.Component, "component", types.KubeletComponent, | ||
"The component to check health for. Supports kubelet, docker and cri") | ||
fs.StringVar(&hco.SystemService, "system-service", "", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/system-service/systemd service.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
cmd/healthchecker/options/options.go
Outdated
"The time to wait for the exec commands to complete.") | ||
} | ||
|
||
// ValidOrDie validates health checker command line options. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
pkg/healthchecker/health_checker.go
Outdated
case types.KubeletComponent: | ||
return kubeletHealthCheck | ||
case types.DockerComponent: | ||
return func(timeout time.Duration) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Can we make the way we use timeout consistent among the 3 functions? Maybe outside function accept
HealthCheckerOptions
, internal function doesn't accept arguments? - What is the relationship between
HealthCheckTimeout
andCmdTimeout
? It is a bit confusing. :P
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Done. Made the kubelet function inline.
- HealthCheckTimeout is used to timeout actions in health check function. CmdTimeout is the timeout used for all other CLI commands run. For example : getUptime. Kept these two different because we might want to have a longer timeout on health check than other commands.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know what they are, but it is hard for users to understand when see the flag description.
Do we need to make command timeout configurable? Maybe have a constant value for it? I don't think people will need to tweak it. If they do, we can think about how to name that flag then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Made it a constant and removed from the flags.
f1bb113
to
5d48436
Compare
"k8s.io/node-problem-detector/pkg/healthchecker/types" | ||
) | ||
|
||
func TestValidOrDie(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TestIsValid
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
pkg/healthchecker/health_checker.go
Outdated
case types.KubeletComponent: | ||
return kubeletHealthCheck | ||
case types.DockerComponent: | ||
return func(timeout time.Duration) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know what they are, but it is hard for users to understand when see the flag description.
Do we need to make command timeout configurable? Maybe have a constant value for it? I don't think people will need to tweak it. If they do, we can think about how to name that flag then.
5d48436
to
44dc4aa
Compare
/retest |
1 similar comment
/retest |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: abansal4032, Random-Liu, wangzhen127 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
The PR contains the health-monitor implementation as a NPD monitor.
This is implemented as a new binary : health-checker, which can be enabled using config/container-runtime-health-checker.json and config/kubelet-health-checker.json to check and repair container runtime and kubelet services respectively.