-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
openshift-sdn does not tolerate being restarted #16630
Comments
Questions:
Other follow up:
|
@openshift/sig-networking |
Looks like when running the full node (not containerized), it also doesn't get reconnected on restart. What am I missing? |
It shouldn't be saying that if nothing (config or code) changed. It should just pick up the existing setup. So that's one problem. (The code to check if things are already set up must not be working in this environment?) It looks like it's recreating it correctly though. In particular, it runs through a |
I'll dig in and recreate, may ask you guys to help interpret.
On Oct 2, 2017, at 9:05 AM, Dan Winship <notifications@github.com> wrote:
I1001 17:08:44.520570 70577 sdn_controller.go:174] [SDN setup] full
SDN setup required
It shouldn't be saying that if nothing (config or code) changed. It should
just pick up the existing setup. So that's one problem. (The code to check
if things are already set up must not be working in this environment?)
It looks like it's recreating it correctly though. In particular, it runs
through a Processing pod network request &{UPDATE ... for each pod and
creates correct-looking OVS flows for them...
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#16630 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG_p_hDhT60oAF07ZdeWdw7Au4Nv_98ks5soN-EgaJpZM4Pp9Sh>
.
|
Added debugging:
From this line in alreadySetUp()
|
Master config for networking:
and kube
|
Thanks for the fix. One question as well - should we have a baby sitter that can detect when we are missing large numbers of flows and retriever setup? Crashloop, perhaps? I.e. if i shoot OVS, how long before the SDN controller detects that and fixes it? Ideally having a periodic that refreshes within a window would be ideal - the faster we can detect a failure from OVS the better. Do we (or can we) heartbeat OVS from the SDN controller and detect these sorts of disruptions? |
Automatic merge from submit-queue. Fix route checking in alreadySetUp We want to check that each cluster network has a corresponding route, not that each route has a corresponding cluster network. Fixes #16630
Holding open to close out whether we need to do more (if we're going to be running in a pod setup) to be resilient to OVS restart. |
In OCP, we set up the systemd unit files so that if systemd restarts OVS (eg due to crash, or upgrade), then it will restart OpenShift too, so we recover. But if you just "ip link del br0", then things will stay broken until you restart OpenShift yourself. But you know, "don't do that then"? |
We won't have this when we run in pods, which is why I'm asking. Is
presence of br0 sufficient to detect a restarted OVS?
…On Sat, Oct 7, 2017 at 12:54 PM, Dan Winship ***@***.***> wrote:
One question as well - should we have a baby sitter that can detect when
we are missing large numbers of flows and retriever setup? Crashloop,
perhaps? I.e. if i shoot OVS, how long before the SDN controller detects
that and fixes it?
In OCP, we set up the systemd unit files so that if systemd restarts OVS
(eg due to crash, or upgrade), then it will restart OpenShift too, so we
recover. But if you just "ip link del br0", then things will stay broken
until you restart OpenShift yourself. But you know, "don't do that then"?
—
You are receiving this because you modified the open/close state.
Reply to this email directly, view it on GitHub
<#16630 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABG_p1WvjJ2w-ocuzsZTm5UdtF2b9ZCmks5sp6yqgaJpZM4Pp9Sh>
.
|
Opened a fix that health checks ovs and exits the process if ovs is detected as reset. I also want to look at what events we should send here |
Added an event for when the pod is restarted |
Automatic merge from submit-queue (batch tested with PRs 16737, 16638, 16742, 16765, 16711). Health check the OVS process and restart if it dies Reorganize the existing setup code to perform a periodic background check on the state of the OVS database. If the SDN setup is lost, force the node/network processes to restart. Use the JSONRPC endpoint to perform a few simple checks of status, and detect failure quickly. This reuses our existing health check code, which does not appear to be a performance issue when checked periodically. Node waiting for OVS to start: ``` I1008 06:41:25.661293 11598 healthcheck.go:27] waiting for OVS to start: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory I1008 06:41:26.690356 11598 healthcheck.go:27] waiting for OVS to start: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory I1008 06:41:27.653112 11598 healthcheck.go:27] waiting for OVS to start: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory I1008 06:41:28.671950 11598 healthcheck.go:27] waiting for OVS to start: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory I1008 06:41:29.653713 11598 healthcheck.go:27] waiting for OVS to start: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory W1008 06:41:30.285617 11598 cni.go:189] Unable to update cni config: No networks found in /etc/cni/net.d E1008 06:41:30.286780 11598 kubelet.go:2093] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized I1008 06:41:30.661441 11598 healthcheck.go:27] waiting for OVS to start: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory I1008 06:41:31.653232 11598 healthcheck.go:27] waiting for OVS to start: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory I1008 06:41:32.674697 11598 sdn_controller.go:180] [SDN setup] full SDN setup required ``` Let node start, then stop OVS, node detects immediately ``` I1008 06:41:40.208239 11598 kubelet_node_status.go:433] Recording NodeReady event message for node localhost.localdomain I1008 06:41:43.076299 11598 nodecontroller.go:770] NodeController detected that some Nodes are Ready. Exiting master disruption mode. E1008 06:41:50.941351 11598 healthcheck.go:55] SDN healthcheck disconnected from OVS server: <nil> I1008 06:41:50.941541 11598 healthcheck.go:60] SDN healthcheck unable to reconnect to OVS server: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory I1008 06:41:51.045661 11598 healthcheck.go:60] SDN healthcheck unable to reconnect to OVS server: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory F1008 06:41:51.148105 11598 healthcheck.go:76] SDN healthcheck detected unhealthy OVS server, restarting: OVS health check failed ``` Fixes #16630 @openshift/sig-networking
When working with the bootstrap code (#16571) I'm seeing that restarts of the networking process (sdn, proxy, dns) result in multi-tenant SDN connectivity being lost when the new process comes up. The pods remain reachable while the old process is terminating, but creation results in the existing pods having no connectivity.
Scenario:
oc run --restart=Never --image centos:7 debug -- /bin/bash -c '(sleep 10000)'
oc run --restart=Never --image gcr.io/google-containers/test-webserver imagetest
oc exec debug -- curl $(oc get pod imagetest -o jsonpath={.status.podIP})
, able to see contentsIt looks like within that existing pod all networking is lost, to the host for dns, to the service network, etc.
Dump from within ovs pod:
The text was updated successfully, but these errors were encountered: