Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

openshift-sdn network restart terminates run once pods immediately #16632

Closed
smarterclayton opened this issue Oct 1, 2017 · 6 comments
Closed
Assignees
Labels
component/networking kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/P1 sig/networking

Comments

@smarterclayton
Copy link
Contributor

smarterclayton commented Oct 1, 2017

The CRI net namespace restart function when openshift-sdn restarts is terminating run once pods that may not need networking, leading to failures.

It's not clear to me that completely terminating all run-once pods on a node when the sdn process is disrupted is correct.

I1001 22:40:49.619765  123473 pod.go:250] Processed pod network request &{UPDATE openshift-node imagetest acda4ba2cdc58950364307639a38e0724a2b57bd519a0a576fe6f766d1617467  0xc42097d680}, result  err failed to find pod details from OVS flows
I1001 22:40:49.619819  123473 pod.go:215] Returning pod network request &{UPDATE openshift-node imagetest acda4ba2cdc58950364307639a38e0724a2b57bd519a0a576fe6f766d1617467  0xc42097d680}, result  err failed to find pod details from OVS flows
W1001 22:40:49.619830  123473 node.go:368] will restart pod 'openshift-node/imagetest' due to update failure on restart: failed to find pod details from OVS flows
I1001 22:40:49.622187  123473 node.go:290] Killing pod 'openshift-node/debug' sandbox due to failed restart
I1001 22:40:49.647180  123473 cniserver.go:231] Waiting for DEL result for pod openshift-node/debug
I1001 22:40:49.647208  123473 pod.go:212] Dispatching pod network request &{DEL openshift-node debug cd5d493cf280f661a176f7449e1b4946e04bbf274e75954df755d9e959323e53 /proc/121859/ns/net 0xc42097de00}
I1001 22:40:49.653653  123473 pod.go:248] Processing pod network request &{DEL openshift-node debug cd5d493cf280f661a176f7449e1b4946e04bbf274e75954df755d9e959323e53 /proc/121859/ns/net 0xc42097de00}
oc get pods
NAME         READY     STATUS        RESTARTS   AGE
debug        1/1       Running       0          5m
imagetest    0/1       Error         0          5m

@openshift/sig-networking

@smarterclayton smarterclayton added component/networking sig/networking kind/bug Categorizes issue or PR as related to a bug. labels Oct 1, 2017
@danwinship
Copy link
Contributor

It terminates pods if (and only if) it can't re-establish networking to them. The assumption was that kubernetes would restart the pod in that case but I guess that doesn't work in all cases.

But it would only be unable to re-establish networking to them if something went wrong during the restart. This is basically a dup/extension of #16630.

@smarterclayton
Copy link
Contributor Author

Yeah, we should probably not be restarting restart=Never pods, because there is nothing we can do anymore (its network is going to continue broken, it's up to the container to die).

@smarterclayton
Copy link
Contributor Author

p1 because we terminate pods that might run safely to completion

@openshift-bot
Copy link
Contributor

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci-robot openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 22, 2018
@openshift-bot
Copy link
Contributor

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

@openshift-ci-robot openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Mar 24, 2018
@openshift-bot
Copy link
Contributor

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/networking kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/P1 sig/networking
Projects
None yet
Development

No branches or pull requests

6 participants