openshift-sdn network restart terminates run once pods immediately #16632

smarterclayton · 2017-10-01T22:42:54Z

The CRI net namespace restart function when openshift-sdn restarts is terminating run once pods that may not need networking, leading to failures.

It's not clear to me that completely terminating all run-once pods on a node when the sdn process is disrupted is correct.

I1001 22:40:49.619765  123473 pod.go:250] Processed pod network request &{UPDATE openshift-node imagetest acda4ba2cdc58950364307639a38e0724a2b57bd519a0a576fe6f766d1617467  0xc42097d680}, result  err failed to find pod details from OVS flows
I1001 22:40:49.619819  123473 pod.go:215] Returning pod network request &{UPDATE openshift-node imagetest acda4ba2cdc58950364307639a38e0724a2b57bd519a0a576fe6f766d1617467  0xc42097d680}, result  err failed to find pod details from OVS flows
W1001 22:40:49.619830  123473 node.go:368] will restart pod 'openshift-node/imagetest' due to update failure on restart: failed to find pod details from OVS flows
I1001 22:40:49.622187  123473 node.go:290] Killing pod 'openshift-node/debug' sandbox due to failed restart
I1001 22:40:49.647180  123473 cniserver.go:231] Waiting for DEL result for pod openshift-node/debug
I1001 22:40:49.647208  123473 pod.go:212] Dispatching pod network request &{DEL openshift-node debug cd5d493cf280f661a176f7449e1b4946e04bbf274e75954df755d9e959323e53 /proc/121859/ns/net 0xc42097de00}
I1001 22:40:49.653653  123473 pod.go:248] Processing pod network request &{DEL openshift-node debug cd5d493cf280f661a176f7449e1b4946e04bbf274e75954df755d9e959323e53 /proc/121859/ns/net 0xc42097de00}

oc get pods
NAME         READY     STATUS        RESTARTS   AGE
debug        1/1       Running       0          5m
imagetest    0/1       Error         0          5m

@openshift/sig-networking

The text was updated successfully, but these errors were encountered:

danwinship · 2017-10-02T13:21:20Z

It terminates pods if (and only if) it can't re-establish networking to them. The assumption was that kubernetes would restart the pod in that case but I guess that doesn't work in all cases.

But it would only be unable to re-establish networking to them if something went wrong during the restart. This is basically a dup/extension of #16630.

smarterclayton · 2017-10-02T21:47:48Z

Yeah, we should probably not be restarting restart=Never pods, because there is nothing we can do anymore (its network is going to continue broken, it's up to the container to die).

smarterclayton · 2017-10-08T06:36:15Z

p1 because we terminate pods that might run safely to completion

openshift-bot · 2018-02-22T05:14:38Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2018-03-24T05:23:26Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot · 2018-04-23T05:28:44Z

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

smarterclayton added component/networking sig/networking kind/bug Categorizes issue or PR as related to a bug. labels Oct 1, 2017

pweil- assigned knobunc Oct 3, 2017

pweil- added the priority/P2 label Oct 3, 2017

smarterclayton added priority/P1 and removed priority/P2 labels Oct 8, 2017

openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 22, 2018

openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 24, 2018

openshift-ci-robot closed this as completed Apr 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

openshift-sdn network restart terminates run once pods immediately #16632

openshift-sdn network restart terminates run once pods immediately #16632

smarterclayton commented Oct 1, 2017 •

edited

Loading

danwinship commented Oct 2, 2017

smarterclayton commented Oct 2, 2017

smarterclayton commented Oct 8, 2017

openshift-bot commented Feb 22, 2018

openshift-bot commented Mar 24, 2018

openshift-bot commented Apr 23, 2018

openshift-sdn network restart terminates run once pods immediately #16632

openshift-sdn network restart terminates run once pods immediately #16632

Comments

smarterclayton commented Oct 1, 2017 • edited Loading

danwinship commented Oct 2, 2017

smarterclayton commented Oct 2, 2017

smarterclayton commented Oct 8, 2017

openshift-bot commented Feb 22, 2018

openshift-bot commented Mar 24, 2018

openshift-bot commented Apr 23, 2018

smarterclayton commented Oct 1, 2017 •

edited

Loading