Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

openshift-sdn does not tolerate being restarted #16630

Closed
smarterclayton opened this issue Oct 1, 2017 · 13 comments
Closed

openshift-sdn does not tolerate being restarted #16630

smarterclayton opened this issue Oct 1, 2017 · 13 comments
Assignees
Labels
component/networking kind/bug Categorizes issue or PR as related to a bug. priority/P2 sig/networking

Comments

@smarterclayton
Copy link
Contributor

smarterclayton commented Oct 1, 2017

When working with the bootstrap code (#16571) I'm seeing that restarts of the networking process (sdn, proxy, dns) result in multi-tenant SDN connectivity being lost when the new process comes up. The pods remain reachable while the old process is terminating, but creation results in the existing pods having no connectivity.

Scenario:

  1. ovs in a separate container
  2. kubelet running on host
  3. network running inside of a hostPID/hostNetwork pod under a daemonset, directories mounted in, dockershim mounted in
  4. create a debug pod oc run --restart=Never --image centos:7 debug -- /bin/bash -c '(sleep 10000)'
  5. create a serving pod oc run --restart=Never --image gcr.io/google-containers/test-webserver imagetest
  6. exec from debug to serving oc exec debug -- curl $(oc get pod imagetest -o jsonpath={.status.podIP}), able to see contents
  7. delete sdn pod (OVS remains running)
  8. exec continues to work until the new SDN pod comes up, then exec starts to fail

It looks like within that existing pod all networking is lost, to the host for dns, to the service network, etc.

User "sa" set.
Context "default-context" modified.
I1001 17:08:44.474838   70577 start_node.go:286] Reading node configuration from /etc/origin/node/node-config.yaml
I1001 17:08:44.476313   70577 feature_gate.go:144] feature gates: map[RotateKubeletClientCertificate:true RotateKubeletServerCertificate:true]
W1001 17:08:44.477530   70577 server.go:188] WARNING: all flags other than --config, --write-config-to, and --cleanup-iptables are deprecated. Please begin using a config file ASAP.
I1001 17:08:44.488988   70577 node.go:145] Initializing SDN node of type "redhat/openshift-ovs-multitenant" with configured hostname "10.1.2.2" (IP ""), iptables sync period "30s"
I1001 17:08:44.489318   70577 network_config.go:143] DNS Bind to 0.0.0.0:53
I1001 17:08:44.489345   70577 start_node.go:475] Starting node networking 10.1.2.2 (v3.7.0-alpha.1+44d050c-816-dirty)
I1001 17:08:44.489353   70577 node.go:299] Starting openshift-sdn network plugin
I1001 17:08:44.494930   70577 iptables.go:560] couldn't get iptables-restore version; assuming it doesn't support --wait
I1001 17:08:44.497494   70577 iptables.go:100] Syncing openshift iptables rules
I1001 17:08:44.497525   70577 iptables.go:392] running iptables -N [OPENSHIFT-FIREWALL-ALLOW -t filter]
I1001 17:08:44.500118   70577 iptables.go:392] running iptables -C [INPUT -t filter -m comment --comment firewall overrides -j OPENSHIFT-FIREWALL-ALLOW]
I1001 17:08:44.501557   70577 iptables.go:392] running iptables -C [OPENSHIFT-FIREWALL-ALLOW -t filter -p udp --dport 4789 -m comment --comment VXLAN incoming -j ACCEPT]
I1001 17:08:44.503926   70577 iptables.go:392] running iptables -C [OPENSHIFT-FIREWALL-ALLOW -t filter -i tun0 -m comment --comment from SDN to localhost -j ACCEPT]
I1001 17:08:44.505538   70577 iptables.go:392] running iptables -C [OPENSHIFT-FIREWALL-ALLOW -t filter -i docker0 -m comment --comment from docker to localhost -j ACCEPT]
I1001 17:08:44.506548   70577 iptables.go:392] running iptables -N [OPENSHIFT-ADMIN-OUTPUT-RULES -t filter]
I1001 17:08:44.507680   70577 iptables.go:392] running iptables -C [FORWARD -t filter -i tun0 ! -o tun0 -m comment --comment administrator overrides -j OPENSHIFT-ADMIN-OUTPUT-RULES]
I1001 17:08:44.509336   70577 iptables.go:392] running iptables -N [OPENSHIFT-MASQUERADE -t nat]
I1001 17:08:44.510442   70577 iptables.go:392] running iptables -C [POSTROUTING -t nat -m comment --comment rules for masquerading OpenShift traffic -j OPENSHIFT-MASQUERADE]
I1001 17:08:44.512246   70577 iptables.go:392] running iptables -C [OPENSHIFT-MASQUERADE -t nat -s 10.128.0.0/14 -m comment --comment masquerade pod-to-service and pod-to-external traffic -j MASQUERADE]
I1001 17:08:44.513624   70577 iptables.go:392] running iptables -N [OPENSHIFT-FIREWALL-FORWARD -t filter]
I1001 17:08:44.514651   70577 iptables.go:392] running iptables -C [FORWARD -t filter -m comment --comment firewall overrides -j OPENSHIFT-FIREWALL-FORWARD]
I1001 17:08:44.515743   70577 iptables.go:392] running iptables -C [OPENSHIFT-FIREWALL-FORWARD -t filter -s 10.128.0.0/14 -m comment --comment attempted resend after connection close -m conntrack --ctstate INVALID -j DROP]
I1001 17:08:44.516804   70577 iptables.go:392] running iptables -C [OPENSHIFT-FIREWALL-FORWARD -t filter -d 10.128.0.0/14 -m comment --comment forward traffic from SDN -j ACCEPT]
I1001 17:08:44.518988   70577 iptables.go:392] running iptables -C [OPENSHIFT-FIREWALL-FORWARD -t filter -s 10.128.0.0/14 -m comment --comment forward traffic to SDN -j ACCEPT]
I1001 17:08:44.520116   70577 iptables.go:98] syncIPTableRules took 22.619192ms
I1001 17:08:44.520248   70577 sdn_controller.go:157] [SDN setup] node pod subnet 10.128.0.0/23 gateway 10.128.0.1
I1001 17:08:44.520570   70577 sdn_controller.go:174] [SDN setup] full SDN setup required
I1001 17:08:44.520603   70577 ovs.go:139] Executing: ovs-vsctl --if-exists del-br br0 -- add-br br0 -- set Bridge br0 fail-mode=secure protocols=OpenFlow13
I1001 17:08:44.538636   70577 ovs.go:139] Executing: ovs-ofctl -O OpenFlow13 set-frags br0 nx-match
I1001 17:08:44.555424   70577 ovs.go:139] Executing: ovs-vsctl --if-exists del-port br0 vxlan0
I1001 17:08:44.561902   70577 ovs.go:139] Executing: ovs-vsctl --may-exist add-port br0 vxlan0 -- set Interface vxlan0 ofport_request=1 type=vxlan options:remote_ip="flow" options:key="flow"
I1001 17:08:44.570005   70577 ovs.go:139] Executing: ovs-vsctl get Interface vxlan0 ofport
I1001 17:08:44.587814   70577 ovs.go:139] Executing: ovs-vsctl --if-exists del-port br0 tun0
I1001 17:08:44.593370   70577 ovs.go:139] Executing: ovs-vsctl --may-exist add-port br0 tun0 -- set Interface tun0 ofport_request=2 type=internal
I1001 17:08:44.603609   70577 ovs.go:139] Executing: ovs-vsctl get Interface tun0 ofport
I1001 17:08:44.608666   70577 ovs.go:139] Executing: ovs-ofctl -O OpenFlow13 add-flow br0 table=0, priority=200, in_port=1, arp, nw_src=10.128.0.0/14, nw_dst=10.128.0.0/23, actions=move:NXM_NX_TUN_ID[0..31]->NXM_NX_REG0[],goto_table:10
I1001 17:08:44.612712   70577 ovs.go:139] Executing: ovs-ofctl -O OpenFlow13 add-flow br0 table=0, priority=200, in_port=1, ip, nw_src=10.128.0.0/14, nw_dst=10.128.0.0/23, actions=move:NXM_NX_TUN_ID[0..31]->NXM_NX_REG0[],goto_table:10
I1001 17:08:44.615512   70577 ovs.go:139] Executing: ovs-ofctl -O OpenFlow13 add-flow br0 table=0, priority=200, in_port=1, ip, nw_src=10.128.0.0/14, nw_dst=224.0.0.0/4, actions=move:NXM_NX_TUN_ID[0..31]->NXM_NX_REG0[],goto_table:10
I1001 17:08:44.618495   70577 ovs.go:139] Executing: ovs-ofctl -O OpenFlow13 add-flow br0 table=0, priority=150, in_port=1, actions=drop
I1001 17:08:44.621384   70577 ovs.go:139] Executing: ovs-ofctl -O OpenFlow13 add-flow br0 table=0, priority=250, in_port=2, ip, nw_dst=224.0.0.0/4, actions=drop
I1001 17:08:44.624194   70577 ovs.go:139] Executing: ovs-ofctl -O OpenFlow13 add-flow br0 table=0, priority=200, in_port=2, arp, nw_src=10.128.0.1, nw_dst=10.128.0.0/14, actions=goto_table:30
I1001 17:08:44.627717   70577 ovs.go:139] Executing: ovs-ofctl -O OpenFlow13 add-flow br0 table=0, priority=200, in_port=2, ip, actions=goto_table:30
I1001 17:08:44.631047   70577 ovs.go:139] Executing: ovs-ofctl -O OpenFlow13 add-flow br0 table=0, priority=150, in_port=2, actions=drop
I1001 17:08:44.635160   70577 ovs.go:139] Executing: ovs-ofctl -O OpenFlow13 add-flow br0 table=0, priority=100, arp, actions=goto_table:20
I1001 17:08:44.637703   70577 ovs.go:139] Executing: ovs-ofctl -O OpenFlow13 add-flow br0 table=0, priority=100, ip, actions=goto_table:20
I1001 17:08:44.640217   70577 ovs.go:139] Executing: ovs-ofctl -O OpenFlow13 add-flow br0 table=0, priority=0, actions=drop
I1001 17:08:44.643242   70577 ovs.go:139] Executing: ovs-ofctl -O OpenFlow13 add-flow br0 table=10, priority=0, actions=drop
I1001 17:08:44.646764   70577 ovs.go:139] Executing: ovs-ofctl -O OpenFlow13 add-flow br0 table=20, priority=0, actions=drop
I1001 17:08:44.651092   70577 ovs.go:139] Executing: ovs-ofctl -O OpenFlow13 add-flow br0 table=21, priority=0, actions=goto_table:30
I1001 17:08:44.654956   70577 ovs.go:139] Executing: ovs-ofctl -O OpenFlow13 add-flow br0 table=30, priority=300, arp, nw_dst=10.128.0.1, actions=output:2
I1001 17:08:44.657401   70577 ovs.go:139] Executing: ovs-ofctl -O OpenFlow13 add-flow br0 table=30, priority=200, arp, nw_dst=10.128.0.0/23, actions=goto_table:40
I1001 17:08:44.689995   70577 ovs.go:139] Executing: ovs-ofctl -O OpenFlow13 add-flow br0 table=30, priority=100, arp, nw_dst=10.128.0.0/14, actions=goto_table:50
I1001 17:08:44.695070   70577 ovs.go:139] Executing: ovs-ofctl -O OpenFlow13 add-flow br0 table=30, priority=300, ip, nw_dst=10.128.0.1, actions=output:2
I1001 17:08:44.698663   70577 ovs.go:139] Executing: ovs-ofctl -O OpenFlow13 add-flow br0 table=30, priority=100, ip, nw_dst=172.30.0.0/16, actions=goto_table:60
I1001 17:08:44.706575   70577 ovs.go:139] Executing: ovs-ofctl -O OpenFlow13 add-flow br0 table=30, priority=200, ip, nw_dst=10.128.0.0/23, actions=goto_table:70
I1001 17:08:44.711554   70577 ovs.go:139] Executing: ovs-ofctl -O OpenFlow13 add-flow br0 table=30, priority=100, ip, nw_dst=10.128.0.0/14, actions=goto_table:90
I1001 17:08:44.715316   70577 ovs.go:139] Executing: ovs-ofctl -O OpenFlow13 add-flow br0 table=30, priority=50, in_port=1, ip, nw_dst=224.0.0.0/4, actions=goto_table:120
I1001 17:08:44.718247   70577 ovs.go:139] Executing: ovs-ofctl -O OpenFlow13 add-flow br0 table=30, priority=25, ip, nw_dst=224.0.0.0/4, actions=goto_table:110
I1001 17:08:44.720732   70577 ovs.go:139] Executing: ovs-ofctl -O OpenFlow13 add-flow br0 table=30, priority=0, ip, actions=goto_table:100
I1001 17:08:44.724321   70577 ovs.go:139] Executing: ovs-ofctl -O OpenFlow13 add-flow br0 table=30, priority=0, arp, actions=drop
I1001 17:08:44.727537   70577 ovs.go:139] Executing: ovs-ofctl -O OpenFlow13 add-flow br0 table=40, priority=0, actions=drop
I1001 17:08:44.730261   70577 ovs.go:139] Executing: ovs-ofctl -O OpenFlow13 add-flow br0 table=50, priority=0, actions=drop
I1001 17:08:44.735124   70577 ovs.go:139] Executing: ovs-ofctl -O OpenFlow13 add-flow br0 table=60, priority=200, reg0=0, actions=output:2
I1001 17:08:44.738425   70577 ovs.go:139] Executing: ovs-ofctl -O OpenFlow13 add-flow br0 table=60, priority=0, actions=drop
I1001 17:08:44.741716   70577 ovs.go:139] Executing: ovs-ofctl -O OpenFlow13 add-flow br0 table=70, priority=0, actions=drop
I1001 17:08:44.744070   70577 ovs.go:139] Executing: ovs-ofctl -O OpenFlow13 add-flow br0 table=80, priority=300, ip, nw_src=10.128.0.1/32, actions=output:NXM_NX_REG2[]
I1001 17:08:44.747215   70577 ovs.go:139] Executing: ovs-ofctl -O OpenFlow13 add-flow br0 table=80, priority=0, actions=drop
I1001 17:08:44.827038   70577 ovs.go:139] Executing: ovs-ofctl -O OpenFlow13 add-flow br0 table=90, priority=0, actions=drop
I1001 17:08:44.831202   70577 ovs.go:139] Executing: ovs-ofctl -O OpenFlow13 add-flow br0 table=100, priority=0, actions=goto_table:101
I1001 17:08:44.835278   70577 ovs.go:139] Executing: ovs-ofctl -O OpenFlow13 add-flow br0 table=101, priority=51,tcp,tcp_dst=53,nw_dst=10.1.2.2,actions=output:2
I1001 17:08:44.838914   70577 ovs.go:139] Executing: ovs-ofctl -O OpenFlow13 add-flow br0 table=101, priority=51,udp,udp_dst=53,nw_dst=10.1.2.2,actions=output:2
I1001 17:08:44.841359   70577 ovs.go:139] Executing: ovs-ofctl -O OpenFlow13 add-flow br0 table=101, priority=0, actions=output:2
I1001 17:08:44.844543   70577 ovs.go:139] Executing: ovs-ofctl -O OpenFlow13 add-flow br0 table=110, priority=0, actions=drop
I1001 17:08:44.847219   70577 ovs.go:139] Executing: ovs-ofctl -O OpenFlow13 add-flow br0 table=111, priority=100, actions=goto_table:120
I1001 17:08:44.851924   70577 ovs.go:139] Executing: ovs-ofctl -O OpenFlow13 add-flow br0 table=120, priority=0, actions=drop
I1001 17:08:44.854680   70577 ovs.go:139] Executing: ovs-ofctl -O OpenFlow13 add-flow br0 table=253, actions=note:01.05
I1001 17:08:44.858730   70577 reflector.go:202] Starting reflector *network.NetNamespace (30m0s) from github.com/openshift/origin/pkg/network/common/common.go:190
I1001 17:08:44.858788   70577 reflector.go:251] Listing and watching *network.NetNamespace from github.com/openshift/origin/pkg/network/common/common.go:190
I1001 17:08:44.859307   70577 reflector.go:202] Starting reflector *network.HostSubnet (30m0s) from github.com/openshift/origin/pkg/network/common/common.go:190
I1001 17:08:44.859370   70577 reflector.go:202] Starting reflector *network.HostSubnet (30m0s) from github.com/openshift/origin/pkg/network/common/common.go:190
I1001 17:08:44.859370   70577 reflector.go:251] Listing and watching *network.HostSubnet from github.com/openshift/origin/pkg/network/common/common.go:190
I1001 17:08:44.859395   70577 reflector.go:251] Listing and watching *network.HostSubnet from github.com/openshift/origin/pkg/network/common/common.go:190
I1001 17:08:44.867912   70577 vnids.go:134] Associate netid 0 to namespace "default" with mcEnabled false
I1001 17:08:44.867932   70577 vnids.go:134] Associate netid 3147011 to namespace "kube-public" with mcEnabled false
I1001 17:08:44.867939   70577 vnids.go:134] Associate netid 14891178 to namespace "kube-system" with mcEnabled false
I1001 17:08:44.867945   70577 vnids.go:134] Associate netid 7390507 to namespace "openshift" with mcEnabled false
I1001 17:08:44.867950   70577 vnids.go:134] Associate netid 16724382 to namespace "openshift-infra" with mcEnabled false
I1001 17:08:44.867956   70577 vnids.go:134] Associate netid 6923887 to namespace "openshift-node" with mcEnabled false
I1001 17:08:44.867961   70577 vnids.go:134] Associate netid 7551784 to namespace "test" with mcEnabled false
I1001 17:08:44.867967   70577 vnids.go:134] Associate netid 9564310 to namespace "test2" with mcEnabled false
I1001 17:08:44.867978   70577 ovs.go:139] Executing: ovs-ofctl -O OpenFlow13 add-flow br0 table=80, priority=200, reg0=0, actions=output:NXM_NX_REG2[]
I1001 17:08:44.871083   70577 reflector.go:202] Starting reflector *network.NetNamespace (30m0s) from github.com/openshift/origin/pkg/network/common/common.go:190
I1001 17:08:44.871838   70577 reflector.go:251] Listing and watching *network.NetNamespace from github.com/openshift/origin/pkg/network/common/common.go:190
I1001 17:08:44.875631   70577 vnids.go:184] Watch Sync event for NetNamespace "default"
I1001 17:08:44.875689   70577 vnids.go:184] Watch Sync event for NetNamespace "kube-public"
I1001 17:08:44.875697   70577 vnids.go:184] Watch Sync event for NetNamespace "kube-system"
I1001 17:08:44.875703   70577 vnids.go:184] Watch Sync event for NetNamespace "openshift"
I1001 17:08:44.875718   70577 vnids.go:184] Watch Sync event for NetNamespace "openshift-infra"
I1001 17:08:44.875724   70577 vnids.go:184] Watch Sync event for NetNamespace "openshift-node"
I1001 17:08:44.875730   70577 vnids.go:184] Watch Sync event for NetNamespace "test"
I1001 17:08:44.875736   70577 vnids.go:184] Watch Sync event for NetNamespace "test2"
I1001 17:08:44.891184   70577 ovs.go:139] Executing: ovs-ofctl -O OpenFlow13 add-flow br0 table=80, priority=200, reg1=0, actions=output:NXM_NX_REG2[]
I1001 17:08:44.896097   70577 node.go:350] Starting openshift-sdn pod manager
I1001 17:08:44.896436   70577 reflector.go:202] Starting reflector *network.EgressNetworkPolicy (30m0s) from github.com/openshift/origin/pkg/network/common/common.go:190
I1001 17:08:44.896494   70577 reflector.go:251] Listing and watching *network.EgressNetworkPolicy from github.com/openshift/origin/pkg/network/common/common.go:190
I1001 17:08:44.901502   70577 iptables.go:560] couldn't get iptables-restore version; assuming it doesn't support --wait
E1001 17:08:44.902653   70577 cniserver.go:130] failed to remove old pod info socket: remove /var/run/openshift-sdn: device or resource busy
I1001 17:08:44.905707   70577 remote_runtime.go:42] Connecting to runtime service /var/run/kubernetes/dockershim.sock
W1001 17:08:44.905751   70577 util_linux.go:75] Using "/var/run/kubernetes/dockershim.sock" as endpoint is deprecated, please consider using full url format "unix:///var/run/kubernetes/dockershim.sock".
I1001 17:08:44.908042   70577 pod.go:212] Dispatching pod network request &{UPDATE openshift-node debug a12ab5ceb5373f045f169360a58239526363e960f8aff56db7f40175a70a2f15  0xc420a20ba0}
I1001 17:08:44.908107   70577 pod.go:248] Processing pod network request &{UPDATE openshift-node debug a12ab5ceb5373f045f169360a58239526363e960f8aff56db7f40175a70a2f15  0xc420a20ba0}
I1001 17:08:44.908128   70577 ovs.go:139] Executing: ovs-ofctl -O OpenFlow13 dump-flows br0
I1001 17:08:44.911380   70577 ovs.go:139] Executing: ovs-ofctl -O OpenFlow13 del-flows br0 ip, nw_dst=10.128.0.89
I1001 17:08:44.914808   70577 ovs.go:139] Executing: ovs-ofctl -O OpenFlow13 del-flows br0 ip, nw_src=10.128.0.89
I1001 17:08:44.917524   70577 ovs.go:139] Executing: ovs-ofctl -O OpenFlow13 del-flows br0 arp, nw_dst=10.128.0.89
I1001 17:08:44.921207   70577 ovs.go:139] Executing: ovs-ofctl -O OpenFlow13 del-flows br0 arp, nw_src=10.128.0.89
I1001 17:08:44.924725   70577 ovs.go:139] Executing: ovs-ofctl -O OpenFlow13 add-flow br0 table=20, priority=100, in_port=6, arp, nw_src=10.128.0.89, arp_sha=0a:58:0a:80:00:59, actions=load:6923887->NXM_NX_REG0[], note:a1.2a.b5.ce.b5.37.3f.04.5f.16.93.60.a5.82.39.52.63.63.e9.60.f8.af.f5.6d.b7.f4.01.75.a7.0a.2f.15, goto_table:21
I1001 17:08:44.928280   70577 ovs.go:139] Executing: ovs-ofctl -O OpenFlow13 add-flow br0 table=20, priority=100, in_port=6, ip, nw_src=10.128.0.89, actions=load:6923887->NXM_NX_REG0[], goto_table:21
I1001 17:08:44.932192   70577 ovs.go:139] Executing: ovs-ofctl -O OpenFlow13 add-flow br0 table=40, priority=100, arp, nw_dst=10.128.0.89, actions=output:6
I1001 17:08:44.934788   70577 ovs.go:139] Executing: ovs-ofctl -O OpenFlow13 add-flow br0 table=70, priority=100, ip, nw_dst=10.128.0.89, actions=load:6923887->NXM_NX_REG1[], load:6->NXM_NX_REG2[], goto_table:80
I1001 17:08:44.937572   70577 pod.go:250] Processed pod network request &{UPDATE openshift-node debug a12ab5ceb5373f045f169360a58239526363e960f8aff56db7f40175a70a2f15  0xc420a20ba0}, result  err <nil>
I1001 17:08:44.937618   70577 pod.go:215] Returning pod network request &{UPDATE openshift-node debug a12ab5ceb5373f045f169360a58239526363e960f8aff56db7f40175a70a2f15  0xc420a20ba0}, result  err <nil>
I1001 17:08:44.937635   70577 multitenant.go:138] EnsureVNIDRules 6923887 - adding rules
I1001 17:08:44.937646   70577 ovs.go:139] Executing: ovs-ofctl -O OpenFlow13 add-flow br0 table=80, priority=100, reg0=6923887, reg1=6923887, actions=output:NXM_NX_REG2[]
I1001 17:08:44.942978   70577 pod.go:212] Dispatching pod network request &{UPDATE openshift-node imagetest3 371dc5645a6dc875a04469cf45606f71bca3b662b5da5459c4d369d8326f030d  0xc420a212c0}
I1001 17:08:44.943009   70577 pod.go:248] Processing pod network request &{UPDATE openshift-node imagetest3 371dc5645a6dc875a04469cf45606f71bca3b662b5da5459c4d369d8326f030d  0xc420a212c0}
I1001 17:08:44.943025   70577 ovs.go:139] Executing: ovs-ofctl -O OpenFlow13 dump-flows br0
I1001 17:08:44.948176   70577 ovs.go:139] Executing: ovs-ofctl -O OpenFlow13 del-flows br0 ip, nw_dst=10.128.0.88
I1001 17:08:44.951593   70577 ovs.go:139] Executing: ovs-ofctl -O OpenFlow13 del-flows br0 ip, nw_src=10.128.0.88
I1001 17:08:44.956378   70577 ovs.go:139] Executing: ovs-ofctl -O OpenFlow13 del-flows br0 arp, nw_dst=10.128.0.88
I1001 17:08:44.960566   70577 ovs.go:139] Executing: ovs-ofctl -O OpenFlow13 del-flows br0 arp, nw_src=10.128.0.88
I1001 17:08:44.987806   70577 ovs.go:139] Executing: ovs-ofctl -O OpenFlow13 add-flow br0 table=20, priority=100, in_port=5, arp, nw_src=10.128.0.88, arp_sha=0a:58:0a:80:00:58, actions=load:6923887->NXM_NX_REG0[], note:37.1d.c5.64.5a.6d.c8.75.a0.44.69.cf.45.60.6f.71.bc.a3.b6.62.b5.da.54.59.c4.d3.69.d8.32.6f.03.0d, goto_table:21
I1001 17:08:44.991724   70577 ovs.go:139] Executing: ovs-ofctl -O OpenFlow13 add-flow br0 table=20, priority=100, in_port=5, ip, nw_src=10.128.0.88, actions=load:6923887->NXM_NX_REG0[], goto_table:21
I1001 17:08:44.994538   70577 ovs.go:139] Executing: ovs-ofctl -O OpenFlow13 add-flow br0 table=40, priority=100, arp, nw_dst=10.128.0.88, actions=output:5
I1001 17:08:44.998331   70577 ovs.go:139] Executing: ovs-ofctl -O OpenFlow13 add-flow br0 table=70, priority=100, ip, nw_dst=10.128.0.88, actions=load:6923887->NXM_NX_REG1[], load:5->NXM_NX_REG2[], goto_table:80
I1001 17:08:45.002527   70577 pod.go:250] Processed pod network request &{UPDATE openshift-node imagetest3 371dc5645a6dc875a04469cf45606f71bca3b662b5da5459c4d369d8326f030d  0xc420a212c0}, result  err <nil>
I1001 17:08:45.002667   70577 pod.go:215] Returning pod network request &{UPDATE openshift-node imagetest3 371dc5645a6dc875a04469cf45606f71bca3b662b5da5459c4d369d8326f030d  0xc420a212c0}, result  err <nil>
I1001 17:08:45.002754   70577 node.go:393] openshift-sdn network plugin ready
I1001 17:08:45.002835   70577 ovs.go:139] Executing: ovs-ofctl -O OpenFlow13 dump-flows br0
I1001 17:08:45.003508   70577 ovs.go:139] Executing: ovs-ofctl -O OpenFlow13 dump-flows br0
I1001 17:08:45.006267   70577 iptables.go:560] couldn't get iptables-restore version; assuming it doesn't support --wait
I1001 17:08:45.009477   70577 network.go:87] Using iptables Proxier.
E1001 17:08:45.010305   70577 metrics.go:172] failed to parse max ARP entries "32768\n" for metrics: *strconv.NumError strconv.Atoi: parsing "32768\n": invalid syntax
I1001 17:08:45.010806   70577 multitenant.go:152] SyncVNIDRules: 0 unused VNIDs
W1001 17:08:45.012338   70577 proxier.go:488] clusterCIDR not specified, unable to distinguish between internal and external traffic
I1001 17:08:45.012373   70577 proxier.go:518] minSyncPeriod: 0s, syncPeriod: 30s, burstSyncs: 2
I1001 17:08:45.012529   70577 network.go:118] Tearing down userspace rules.
... iptables spam
I1001 17:08:45.072344   70577 proxy.go:81] Starting multitenant SDN proxy endpoint filter
I1001 17:08:45.072515   70577 config.go:202] Starting service config controller
I1001 17:08:45.072555   70577 controller_utils.go:1025] Waiting for caches to sync for service config controller
I1001 17:08:45.075968   70577 network.go:225] Started Kubernetes Proxy on 0.0.0.0
I1001 17:08:45.076022   70577 healthcheck.go:306] Starting goroutine for healthz on 0.0.0.0:10256
I1001 17:08:45.076326   70577 reflector.go:202] Starting reflector *network.EgressNetworkPolicy (30m0s) from github.com/openshift/origin/pkg/network/common/common.go:190
I1001 17:08:45.076351   70577 reflector.go:251] Listing and watching *network.EgressNetworkPolicy from github.com/openshift/origin/pkg/network/common/common.go:190
I1001 17:08:45.076709   70577 config.go:102] Starting endpoints config controller
I1001 17:08:45.076723   70577 controller_utils.go:1025] Waiting for caches to sync for endpoints config controller
I1001 17:08:45.076328   70577 network.go:51] Starting DNS on 0.0.0.0:53
I1001 17:08:45.076749   70577 reflector.go:213] Starting reflector *api.Endpoints (0s) from github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:72
I1001 17:08:45.076773   70577 reflector.go:251] Listing and watching *api.Endpoints from github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:72
I1001 17:08:45.077016   70577 logs.go:41] skydns: ready for queries on cluster.local. for tcp://0.0.0.0:53 [rcache 0]
I1001 17:08:45.077030   70577 logs.go:41] skydns: ready for queries on cluster.local. for udp://0.0.0.0:53 [rcache 0]
I1001 17:08:45.077209   70577 bounded_frequency_runner.go:170] sync-runner Loop running
I1001 17:08:45.077287   70577 reflector.go:202] Starting reflector *network.NetNamespace (30m0s) from github.com/openshift/origin/pkg/network/common/common.go:190
I1001 17:08:45.077305   70577 reflector.go:251] Listing and watching *network.NetNamespace from github.com/openshift/origin/pkg/network/common/common.go:190
I1001 17:08:45.077423   70577 reflector.go:213] Starting reflector *api.Service (0s) from github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:72
I1001 17:08:45.077445   70577 reflector.go:251] Listing and watching *api.Service from github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/client/informers/informers_generated/internalversion/factory.go:72
I1001 17:08:45.079354   70577 config.go:224] Calling handler.OnServiceAdd
I1001 17:08:45.079404   70577 proxy.go:80] hybrid proxy: add svc kubernetes in main proxy
I1001 17:08:45.079449   70577 proxy.go:135] Watch Sync event for NetNamespace "openshift"
I1001 17:08:45.079477   70577 proxy.go:135] Watch Sync event for NetNamespace "openshift-infra"
I1001 17:08:45.079470   70577 node.go:476] Watch ADDED event for Service "kubernetes"
I1001 17:08:45.079484   70577 proxy.go:135] Watch Sync event for NetNamespace "openshift-node"
I1001 17:08:45.079493   70577 proxy.go:135] Watch Sync event for NetNamespace "test"
I1001 17:08:45.079500   70577 proxy.go:135] Watch Sync event for NetNamespace "test2"
I1001 17:08:45.079510   70577 proxy.go:135] Watch Sync event for NetNamespace "default"
I1001 17:08:45.079519   70577 proxy.go:135] Watch Sync event for NetNamespace "kube-public"
I1001 17:08:45.079550   70577 proxy.go:135] Watch Sync event for NetNamespace "kube-system"
I1001 17:08:45.079516   70577 sdn_controller.go:255] AddServiceRules for &{{ } {kubernetes  default /api/v1/namespaces/default/services/kubernetes f43f2878-a618-11e7-ac65-7831c1b76042 56 0 2017-09-30 19:52:56 +0000 UTC <nil> <nil> map[component:apiserver provider:kubernetes] map[] [] nil [] } {ClusterIP [{https TCP 443 {0 8443 } 0} {dns UDP 53 {0 8053 } 0} {dns-tcp TCP 53 {0 8053 } 0}] map[] 172.30.0.1  []  ClientIP []  0} {{[]}}}
I1001 17:08:45.079946   70577 ovs.go:139] Executing: ovs-ofctl -O OpenFlow13 add-flow br0 table=60, ip, nw_dst=172.30.0.1, ip_frag=later, priority=100, actions=load:0->NXM_NX_REG1[], load:2->NXM_NX_REG2[], goto_table:80
I1001 17:08:45.080512   70577 config.go:124] Calling handler.OnEndpointsAdd
I1001 17:08:45.080601   70577 proxy.go:170] hybrid proxy: (always) add ep kubernetes in unidling proxy
I1001 17:08:45.080645   70577 roundrobin.go:276] LoadBalancerRR: Setting endpoints for default/kubernetes:https to [192.168.1.106:8443]
I1001 17:08:45.080746   70577 roundrobin.go:100] LoadBalancerRR service "default/kubernetes:https" did not exist, created
I1001 17:08:45.080760   70577 roundrobin.go:276] LoadBalancerRR: Setting endpoints for default/kubernetes:dns-tcp to [192.168.1.106:8053]
I1001 17:08:45.080768   70577 roundrobin.go:100] LoadBalancerRR service "default/kubernetes:dns-tcp" did not exist, created
I1001 17:08:45.080776   70577 roundrobin.go:276] LoadBalancerRR: Setting endpoints for default/kubernetes:dns to [192.168.1.106:8053]
I1001 17:08:45.080792   70577 roundrobin.go:100] LoadBalancerRR service "default/kubernetes:dns" did not exist, created
I1001 17:08:45.080800   70577 proxy.go:185] hybrid proxy: add ep kubernetes in main proxy
I1001 17:08:45.080823   70577 proxier.go:882] Setting endpoints for "default/kubernetes:https" to [192.168.1.106:8443]
I1001 17:08:45.080834   70577 proxier.go:882] Setting endpoints for "default/kubernetes:dns-tcp" to [192.168.1.106:8053]
I1001 17:08:45.081205   70577 proxier.go:882] Setting endpoints for "default/kubernetes:dns" to [192.168.1.106:8053]
I1001 17:08:45.085136   70577 ovs.go:139] Executing: ovs-ofctl -O OpenFlow13 add-flow br0 table=60, ip, nw_dst=172.30.0.1, tcp, tcp_dst=443, priority=100, actions=load:0->NXM_NX_REG1[], load:2->NXM_NX_REG2[], goto_table:80
I1001 17:08:45.090320   70577 ovs.go:139] Executing: ovs-ofctl -O OpenFlow13 add-flow br0 table=60, ip, nw_dst=172.30.0.1, udp, udp_dst=53, priority=100, actions=load:0->NXM_NX_REG1[], load:2->NXM_NX_REG2[], goto_table:80
I1001 17:08:45.094232   70577 ovs.go:139] Executing: ovs-ofctl -O OpenFlow13 add-flow br0 table=60, ip, nw_dst=172.30.0.1, tcp, tcp_dst=53, priority=100, actions=load:0->NXM_NX_REG1[], load:2->NXM_NX_REG2[], goto_table:80
I1001 17:08:45.173872   70577 shared_informer.go:116] caches populated
I1001 17:08:45.173969   70577 controller_utils.go:1032] Caches are synced for service config controller
I1001 17:08:45.173980   70577 config.go:210] Calling handler.OnServiceSynced()
I1001 17:08:45.174069   70577 proxier.go:997] Not syncing iptables until Services and Endpoints have been received from master
I1001 17:08:45.174087   70577 proxier.go:993] syncProxyRules took 33.576µs
I1001 17:08:45.174138   70577 proxy.go:126] hybrid proxy: services synced
I1001 17:08:45.177078   70577 shared_informer.go:116] caches populated
I1001 17:08:45.177192   70577 controller_utils.go:1032] Caches are synced for endpoints config controller
I1001 17:08:45.177202   70577 config.go:110] Calling handler.OnEndpointsSynced()
I1001 17:08:45.201579   70577 iptables.go:369] running iptables-restore [--noflush --counters]
I1001 17:08:45.205336   70577 conntrack.go:36] Deleting connection tracking state for service IP 172.30.0.1
I1001 17:08:45.208441   70577 proxier.go:993] syncProxyRules took 31.195608ms
I1001 17:08:45.208470   70577 proxy.go:269] hybrid proxy: endpoints synced

Dump from within ovs pod:

$ ovs-ofctl -O OpenFlow13 dump-flows br0
OFPST_FLOW reply (OF1.3) (xid=0x2):
 cookie=0x0, duration=478.463s, table=0, n_packets=0, n_bytes=0, priority=250,ip,in_port=2,nw_dst=224.0.0.0/4 actions=drop
 cookie=0x0, duration=478.474s, table=0, n_packets=0, n_bytes=0, priority=200,arp,in_port=1,arp_spa=10.128.0.0/14,arp_tpa=10.128.0.0/23 actions=move:NXM_NX_TUN_ID[0..31]->NXM_NX_REG0[],goto_table:10
 cookie=0x0, duration=478.471s, table=0, n_packets=0, n_bytes=0, priority=200,ip,in_port=1,nw_src=10.128.0.0/14,nw_dst=10.128.0.0/23 actions=move:NXM_NX_TUN_ID[0..31]->NXM_NX_REG0[],goto_table:10
 cookie=0x0, duration=478.468s, table=0, n_packets=0, n_bytes=0, priority=200,ip,in_port=1,nw_src=10.128.0.0/14,nw_dst=224.0.0.0/4 actions=move:NXM_NX_TUN_ID[0..31]->NXM_NX_REG0[],goto_table:10
 cookie=0x0, duration=478.459s, table=0, n_packets=0, n_bytes=0, priority=200,arp,in_port=2,arp_spa=10.128.0.1,arp_tpa=10.128.0.0/14 actions=goto_table:30
 cookie=0x0, duration=478.456s, table=0, n_packets=0, n_bytes=0, priority=200,ip,in_port=2 actions=goto_table:30
 cookie=0x0, duration=478.465s, table=0, n_packets=0, n_bytes=0, priority=150,in_port=1 actions=drop
 cookie=0x0, duration=478.452s, table=0, n_packets=32, n_bytes=2592, priority=150,in_port=2 actions=drop
 cookie=0x0, duration=478.449s, table=0, n_packets=8, n_bytes=336, priority=100,arp actions=goto_table:20
 cookie=0x0, duration=478.446s, table=0, n_packets=130, n_bytes=18382, priority=100,ip actions=goto_table:20
 cookie=0x0, duration=478.444s, table=0, n_packets=28, n_bytes=2232, priority=0 actions=drop
 cookie=0x0, duration=478.440s, table=10, n_packets=0, n_bytes=0, priority=0 actions=drop
 cookie=0x0, duration=478.158s, table=20, n_packets=0, n_bytes=0, priority=100,arp,in_port=6,arp_spa=10.128.0.89,arp_sha=0a:58:0a:80:00:59 actions=load:0x69a66f->NXM_NX_REG0[],note:a1.2a.b5.ce.b5.37.3f.04.5f.16.93.60.a5.82.39.52.63.63.e9.60.f8.af.f5.6d.b7.f4.01.75.a7.0a.2f.15.00.00.00.00.00.00,goto_table:21
 cookie=0x0, duration=478.095s, table=20, n_packets=0, n_bytes=0, priority=100,arp,in_port=5,arp_spa=10.128.0.88,arp_sha=0a:58:0a:80:00:58 actions=load:0x69a66f->NXM_NX_REG0[],note:37.1d.c5.64.5a.6d.c8.75.a0.44.69.cf.45.60.6f.71.bc.a3.b6.62.b5.da.54.59.c4.d3.69.d8.32.6f.03.0d.00.00.00.00.00.00,goto_table:21
 cookie=0x0, duration=478.155s, table=20, n_packets=0, n_bytes=0, priority=100,ip,in_port=6,nw_src=10.128.0.89 actions=load:0x69a66f->NXM_NX_REG0[],goto_table:21
 cookie=0x0, duration=478.092s, table=20, n_packets=0, n_bytes=0, priority=100,ip,in_port=5,nw_src=10.128.0.88 actions=load:0x69a66f->NXM_NX_REG0[],goto_table:21
 cookie=0x0, duration=478.436s, table=20, n_packets=0, n_bytes=0, priority=0 actions=drop
 cookie=0x0, duration=478.432s, table=21, n_packets=138, n_bytes=18718, priority=0 actions=goto_table:30
 cookie=0x0, duration=478.429s, table=30, n_packets=0, n_bytes=0, priority=300,arp,arp_tpa=10.128.0.1 actions=output:2
 cookie=0x0, duration=478.388s, table=30, n_packets=0, n_bytes=0, priority=300,ip,nw_dst=10.128.0.1 actions=output:2
 cookie=0x0, duration=478.398s, table=30, n_packets=8, n_bytes=336, priority=200,arp,arp_tpa=10.128.0.0/23 actions=goto_table:40
 cookie=0x0, duration=478.375s, table=30, n_packets=130, n_bytes=18382, priority=200,ip,nw_dst=10.128.0.0/23 actions=goto_table:70
 cookie=0x0, duration=478.392s, table=30, n_packets=0, n_bytes=0, priority=100,arp,arp_tpa=10.128.0.0/14 actions=goto_table:50
 cookie=0x0, duration=478.372s, table=30, n_packets=0, n_bytes=0, priority=100,ip,nw_dst=10.128.0.0/14 actions=goto_table:90
 cookie=0x0, duration=478.380s, table=30, n_packets=0, n_bytes=0, priority=100,ip,nw_dst=172.30.0.0/16 actions=goto_table:60
 cookie=0x0, duration=478.368s, table=30, n_packets=0, n_bytes=0, priority=50,ip,in_port=1,nw_dst=224.0.0.0/4 actions=goto_table:120
 cookie=0x0, duration=478.366s, table=30, n_packets=0, n_bytes=0, priority=25,ip,nw_dst=224.0.0.0/4 actions=goto_table:110
 cookie=0x0, duration=478.363s, table=30, n_packets=0, n_bytes=0, priority=0,ip actions=goto_table:100
 cookie=0x0, duration=478.359s, table=30, n_packets=0, n_bytes=0, priority=0,arp actions=drop
 cookie=0x0, duration=478.152s, table=40, n_packets=0, n_bytes=0, priority=100,arp,arp_tpa=10.128.0.89 actions=output:6
 cookie=0x0, duration=478.089s, table=40, n_packets=0, n_bytes=0, priority=100,arp,arp_tpa=10.128.0.88 actions=output:5
 cookie=0x0, duration=478.357s, table=40, n_packets=0, n_bytes=0, priority=0 actions=drop
 cookie=0x0, duration=478.354s, table=50, n_packets=0, n_bytes=0, priority=0 actions=drop
 cookie=0x0, duration=478.349s, table=60, n_packets=0, n_bytes=0, priority=200,reg0=0 actions=output:2
 cookie=0x0, duration=478.002s, table=60, n_packets=0, n_bytes=0, priority=100,ip,nw_dst=172.30.0.1,nw_frag=later actions=load:0->NXM_NX_REG1[],load:0x2->NXM_NX_REG2[],goto_table:80
 cookie=0x0, duration=477.997s, table=60, n_packets=0, n_bytes=0, priority=100,tcp,nw_dst=172.30.0.1,tp_dst=443 actions=load:0->NXM_NX_REG1[],load:0x2->NXM_NX_REG2[],goto_table:80
 cookie=0x0, duration=477.993s, table=60, n_packets=0, n_bytes=0, priority=100,udp,nw_dst=172.30.0.1,tp_dst=53 actions=load:0->NXM_NX_REG1[],load:0x2->NXM_NX_REG2[],goto_table:80
 cookie=0x0, duration=477.989s, table=60, n_packets=0, n_bytes=0, priority=100,tcp,nw_dst=172.30.0.1,tp_dst=53 actions=load:0->NXM_NX_REG1[],load:0x2->NXM_NX_REG2[],goto_table:80
 cookie=0x0, duration=478.345s, table=60, n_packets=0, n_bytes=0, priority=0 actions=drop
 cookie=0x0, duration=478.149s, table=70, n_packets=0, n_bytes=0, priority=100,ip,nw_dst=10.128.0.89 actions=load:0x69a66f->NXM_NX_REG1[],load:0x6->NXM_NX_REG2[],goto_table:80
 cookie=0x0, duration=478.085s, table=70, n_packets=0, n_bytes=0, priority=100,ip,nw_dst=10.128.0.88 actions=load:0x69a66f->NXM_NX_REG1[],load:0x5->NXM_NX_REG2[],goto_table:80
 cookie=0x0, duration=478.343s, table=70, n_packets=0, n_bytes=0, priority=0 actions=drop
 cookie=0x0, duration=478.340s, table=80, n_packets=0, n_bytes=0, priority=300,ip,nw_src=10.128.0.1 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=478.196s, table=80, n_packets=0, n_bytes=0, priority=200,reg0=0 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=478.192s, table=80, n_packets=0, n_bytes=0, priority=200,reg1=0 actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=478.145s, table=80, n_packets=130, n_bytes=18382, priority=100,reg0=0x69a66f,reg1=0x69a66f actions=output:NXM_NX_REG2[]
 cookie=0x0, duration=478.260s, table=80, n_packets=0, n_bytes=0, priority=0 actions=drop
 cookie=0x0, duration=478.256s, table=90, n_packets=0, n_bytes=0, priority=0 actions=drop
 cookie=0x0, duration=478.252s, table=100, n_packets=0, n_bytes=0, priority=0 actions=goto_table:101
 cookie=0x0, duration=478.248s, table=101, n_packets=0, n_bytes=0, priority=51,tcp,nw_dst=10.1.2.2,tp_dst=53 actions=output:2
 cookie=0x0, duration=478.246s, table=101, n_packets=0, n_bytes=0, priority=51,udp,nw_dst=10.1.2.2,tp_dst=53 actions=output:2
 cookie=0x0, duration=478.243s, table=101, n_packets=0, n_bytes=0, priority=0 actions=output:2
 cookie=0x0, duration=478.240s, table=110, n_packets=0, n_bytes=0, priority=0 actions=drop
 cookie=0x0, duration=478.236s, table=111, n_packets=0, n_bytes=0, priority=100 actions=goto_table:120
 cookie=0x0, duration=478.232s, table=120, n_packets=0, n_bytes=0, priority=0 actions=drop
 cookie=0x0, duration=478.229s, table=253, n_packets=0, n_bytes=0, actions=note:01.05.00.00.00.00
@smarterclayton
Copy link
Contributor Author

smarterclayton commented Oct 1, 2017

Questions:

  1. should openshift-sdn be reentrant when OVS and kubelet haven't restarted (expect so)
  2. is this because the bridge is being reset?
  3. am I missing something in the configuration of the pod that would prevent reestablishment (link to definition 4c6411a#diff-252005c02b4e43123ecf7e3f2dac00ba)

Other follow up:

  1. is hostPID necessary or can we do an indirection to look up the proc for the NS when we assign pods (wouldn't have expected to need it)

@smarterclayton
Copy link
Contributor Author

@openshift/sig-networking

@smarterclayton smarterclayton added the kind/bug Categorizes issue or PR as related to a bug. label Oct 1, 2017
@smarterclayton
Copy link
Contributor Author

smarterclayton commented Oct 2, 2017

Looks like when running the full node (not containerized), it also doesn't get reconnected on restart. What am I missing?

@danwinship
Copy link
Contributor

I1001 17:08:44.520570   70577 sdn_controller.go:174] [SDN setup] full SDN setup required

It shouldn't be saying that if nothing (config or code) changed. It should just pick up the existing setup. So that's one problem. (The code to check if things are already set up must not be working in this environment?)

It looks like it's recreating it correctly though. In particular, it runs through a Processing pod network request &{UPDATE ... for each pod and creates correct-looking OVS flows for them...

@smarterclayton
Copy link
Contributor Author

smarterclayton commented Oct 2, 2017 via email

@smarterclayton
Copy link
Contributor Author

Added debugging:

I1002 19:34:42.676691   94742 sdn_controller.go:163] [SDN setup] node pod subnet 10.128.0.0/23 gateway 10.128.0.1
I1002 19:34:42.677220   94742 sdn_controller.go:105] did not find route in cluster cidrs: {Ifindex: 554 Dst: 172.30.0.0/16 Src: <nil> Gw: <nil> Flags: [] Table: 254} [10.128.0.0/14]
I1002 19:34:42.677275   94742 sdn_controller.go:180] [SDN setup] full SDN setup required

From this line in alreadySetUp()

	for _, route := range routes {
		found = false
		for _, clusterCIDR := range clusterNetworkCIDR {
			if route.Dst != nil && route.Dst.String() == clusterCIDR {
				found = true
				break
			}
		}
		if !found {
			glog.V(3).Infof("did not find route in cluster cidrs: %s %v", route, clusterNetworkCIDR)
			return false
		}
	}

@smarterclayton
Copy link
Contributor Author

Master config for networking:

networkConfig:
  clusterNetworks:
  - cidr: 10.128.0.0/14
    hostSubnetLength: 9
  externalIPNetworkCIDRs: null
  ingressIPNetworkCIDR: 172.29.0.0/16
  networkPluginName: redhat/openshift-ovs-multitenant
  serviceNetworkCIDR: 172.30.0.0/16

and kube

  servicesSubnet: 172.30.0.0/16

@smarterclayton
Copy link
Contributor Author

smarterclayton commented Oct 2, 2017

Thanks for the fix.

One question as well - should we have a baby sitter that can detect when we are missing large numbers of flows and retriever setup? Crashloop, perhaps? I.e. if i shoot OVS, how long before the SDN controller detects that and fixes it?

Ideally having a periodic that refreshes within a window would be ideal - the faster we can detect a failure from OVS the better. Do we (or can we) heartbeat OVS from the SDN controller and detect these sorts of disruptions?

openshift-merge-robot added a commit that referenced this issue Oct 3, 2017
Automatic merge from submit-queue.

Fix route checking in alreadySetUp

We want to check that each cluster network has a corresponding route, not that each route has a corresponding cluster network.

Fixes #16630
@smarterclayton smarterclayton reopened this Oct 3, 2017
@smarterclayton
Copy link
Contributor Author

Holding open to close out whether we need to do more (if we're going to be running in a pod setup) to be resilient to OVS restart.

@danwinship
Copy link
Contributor

One question as well - should we have a baby sitter that can detect when we are missing large numbers of flows and retriever setup? Crashloop, perhaps? I.e. if i shoot OVS, how long before the SDN controller detects that and fixes it?

In OCP, we set up the systemd unit files so that if systemd restarts OVS (eg due to crash, or upgrade), then it will restart OpenShift too, so we recover. But if you just "ip link del br0", then things will stay broken until you restart OpenShift yourself. But you know, "don't do that then"?

@smarterclayton
Copy link
Contributor Author

smarterclayton commented Oct 7, 2017 via email

@smarterclayton
Copy link
Contributor Author

Opened a fix that health checks ovs and exits the process if ovs is detected as reset.

I also want to look at what events we should send here

@smarterclayton
Copy link
Contributor Author

Added an event for when the pod is restarted

openshift-merge-robot added a commit that referenced this issue Oct 10, 2017
Automatic merge from submit-queue (batch tested with PRs 16737, 16638, 16742, 16765, 16711).

Health check the OVS process and restart if it dies

Reorganize the existing setup code to perform a periodic background check on the state of the OVS database. If the SDN setup is lost, force the node/network processes to restart. Use the JSONRPC endpoint to perform a few simple checks of status, and detect failure quickly. This reuses our existing health check code, which does not appear to be a performance issue when checked periodically.

Node waiting for OVS to start:

```
I1008 06:41:25.661293   11598 healthcheck.go:27] waiting for OVS to start: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory
I1008 06:41:26.690356   11598 healthcheck.go:27] waiting for OVS to start: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory
I1008 06:41:27.653112   11598 healthcheck.go:27] waiting for OVS to start: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory
I1008 06:41:28.671950   11598 healthcheck.go:27] waiting for OVS to start: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory
I1008 06:41:29.653713   11598 healthcheck.go:27] waiting for OVS to start: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory
W1008 06:41:30.285617   11598 cni.go:189] Unable to update cni config: No networks found in /etc/cni/net.d
E1008 06:41:30.286780   11598 kubelet.go:2093] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized
I1008 06:41:30.661441   11598 healthcheck.go:27] waiting for OVS to start: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory
I1008 06:41:31.653232   11598 healthcheck.go:27] waiting for OVS to start: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory
I1008 06:41:32.674697   11598 sdn_controller.go:180] [SDN setup] full SDN setup required
```

Let node start, then stop OVS, node detects immediately

```
I1008 06:41:40.208239   11598 kubelet_node_status.go:433] Recording NodeReady event message for node localhost.localdomain
I1008 06:41:43.076299   11598 nodecontroller.go:770] NodeController detected that some Nodes are Ready. Exiting master disruption mode.
E1008 06:41:50.941351   11598 healthcheck.go:55] SDN healthcheck disconnected from OVS server: <nil>
I1008 06:41:50.941541   11598 healthcheck.go:60] SDN healthcheck unable to reconnect to OVS server: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory
I1008 06:41:51.045661   11598 healthcheck.go:60] SDN healthcheck unable to reconnect to OVS server: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory
F1008 06:41:51.148105   11598 healthcheck.go:76] SDN healthcheck detected unhealthy OVS server, restarting: OVS health check failed
```

Fixes #16630

@openshift/sig-networking
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/networking kind/bug Categorizes issue or PR as related to a bug. priority/P2 sig/networking
Projects
None yet
Development

No branches or pull requests

4 participants