Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Health check the OVS process and restart if it dies #16742

Merged

Conversation

smarterclayton
Copy link
Contributor

@smarterclayton smarterclayton commented Oct 8, 2017

Reorganize the existing setup code to perform a periodic background check on the state of the OVS database. If the SDN setup is lost, force the node/network processes to restart. Use the JSONRPC endpoint to perform a few simple checks of status, and detect failure quickly. This reuses our existing health check code, which does not appear to be a performance issue when checked periodically.

Node waiting for OVS to start:

I1008 06:41:25.661293   11598 healthcheck.go:27] waiting for OVS to start: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory
I1008 06:41:26.690356   11598 healthcheck.go:27] waiting for OVS to start: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory
I1008 06:41:27.653112   11598 healthcheck.go:27] waiting for OVS to start: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory
I1008 06:41:28.671950   11598 healthcheck.go:27] waiting for OVS to start: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory
I1008 06:41:29.653713   11598 healthcheck.go:27] waiting for OVS to start: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory
W1008 06:41:30.285617   11598 cni.go:189] Unable to update cni config: No networks found in /etc/cni/net.d
E1008 06:41:30.286780   11598 kubelet.go:2093] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
I1008 06:41:30.661441   11598 healthcheck.go:27] waiting for OVS to start: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory
I1008 06:41:31.653232   11598 healthcheck.go:27] waiting for OVS to start: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory
I1008 06:41:32.674697   11598 sdn_controller.go:180] [SDN setup] full SDN setup required

Let node start, then stop OVS, node detects immediately

I1008 06:41:40.208239   11598 kubelet_node_status.go:433] Recording NodeReady event message for node localhost.localdomain
I1008 06:41:43.076299   11598 nodecontroller.go:770] NodeController detected that some Nodes are Ready. Exiting master disruption mode.
E1008 06:41:50.941351   11598 healthcheck.go:55] SDN healthcheck disconnected from OVS server: <nil>
I1008 06:41:50.941541   11598 healthcheck.go:60] SDN healthcheck unable to reconnect to OVS server: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory
I1008 06:41:51.045661   11598 healthcheck.go:60] SDN healthcheck unable to reconnect to OVS server: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory
F1008 06:41:51.148105   11598 healthcheck.go:76] SDN healthcheck detected unhealthy OVS server, restarting: OVS health check failed

Fixes #16630

@openshift/sig-networking

@openshift-merge-robot openshift-merge-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 8, 2017
@openshift-ci-robot openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 8, 2017
@smarterclayton
Copy link
Contributor Author

Similar in spirit to #16740, where the node process is responsible for checking its dependencies rather than systemd, which means we can tolerate running in pods more easily and reduce required node configuration.

return fmt.Errorf("detected network plugin mismatch between OpenShift node(%q) and master(%q)", pluginName, clusterNetwork.PluginName)
} else {
// Do not return error in this case
glog.Warningf(`either there is network plugin mismatch between OpenShift node(%q) and master or OpenShift master is running an older version where we did not persist plugin name`, pluginName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They'd have to be running a 3.2 or earlier master for ClusterNetwork.PluginName to be unset. There's no way we'd support a 3.7 node against a 3.2 master even during an upgrade, right? So we could just drop the inner if here now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, dead. Will remove.

@@ -13,9 +13,10 @@ import (
"sync"
"time"

log "github.com/golang/glog"
"github.com/golang/glog"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes it so that this commit doesn't compile... would be better to put the log->glog commit first and move this change there.

Copy link
Contributor

@danwinship danwinship left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. A few comments. (Oops, already submitted some as individual comments)

defer c.Close()

err = c.WaitForDisconnect()
utilruntime.HandleError(fmt.Errorf("SDN healthcheck disconnected from OVS server: %v", err))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know anything about the OVS raw protocol, but if it eventually times out idle connections then this might result in spurious errors in the logs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible to configure OVS to do timeouts, but at least ootb on our deployed systems it does not. I also have the 5s disconnect. I think if in practice we see this error showing up we would increase the timeout on connections significantly and still be ok. It's effectively a deadman switch (and it works really well at it).

// TODO: make it possible to safely reestablish node configuration after restart
// If OVS goes down and fails the health check, restart the entire process
healthFn := func() bool { return plugin.alreadySetUp(gwCIDR, clusterNetworkCIDRs) }
runOVSHealthCheck("unix", "/var/run/openvswitch/db.sock", healthFn)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe make "unix" and "/var/run/openvswitch/db.sock" be constants in healthcheck.go

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

k

@smarterclayton
Copy link
Contributor Author

updated

A periodic background process watches for when OVS is reset to the
default state and causes the entire process to restart. This avoids the
need to order the SDN process with OVS, and makes it easier to run the
process in a pod.

In the future it should be possible to avoid restarting the process to
perform this check.
@danwinship
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 10, 2017
@openshift-merge-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: danwinship, smarterclayton

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@openshift-merge-robot
Copy link
Contributor

Automatic merge from submit-queue (batch tested with PRs 16737, 16638, 16742, 16765, 16711).

@openshift-merge-robot openshift-merge-robot merged commit 16e9703 into openshift:master Oct 10, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants