Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recreate pod sandbox when the sandbox does not have an IP address. #48970

Merged

Conversation

caseydavenport
Copy link
Member

What this PR does / why we need it:

Attempts to fix a bug where Pods do not receive networking when the kubelet restarts during pod creation.

Which issue this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged):

fixes # #48510

Release note:

NONE

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jul 15, 2017
@caseydavenport
Copy link
Member Author

/sig network
/release-note-none

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. sig/network Categorizes an issue or PR as relevant to SIG Network. labels Jul 15, 2017
@k8s-github-robot k8s-github-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Jul 15, 2017
@caseydavenport
Copy link
Member Author

I've run a few tests with this, but haven't yet been able to reproduce the issue this was intended to fix (which was occurring pretty reliably on v1.7.0).

Still want to do some more runs to see if I can repro.

@caseydavenport
Copy link
Member Author

/test pull-kubernetes-kubemark-e2e-gce
/test pull-kubernetes-unit

@caseydavenport
Copy link
Member Author

Huh, kubemark tests seem to be failing due to OAuth issues:

I0717 19:00:31.142] Your "OAuth 2.0 Service Account" credentials are invalid. Please run
I0717 19:00:31.142]   $ gcloud auth login
I0717 19:00:31.142] Type position out of range

@caseydavenport
Copy link
Member Author

/retest

@caseydavenport
Copy link
Member Author

@Random-Liu any chance you could take a look and see if you think this approach makes any sense?

@caseydavenport
Copy link
Member Author

@kubernetes/sig-network-pr-reviews ^

@caseydavenport
Copy link
Member Author

/retest

@caseydavenport
Copy link
Member Author

/test pull-kubernetes-bazel-test

@dcbw
Copy link
Member

dcbw commented Sep 6, 2017

@caseydavenport I want to get to a place where the runtime tracks whether the container was fully set up or not (including networking), and I guess they'd have to checkpoint that state somewhere too and validate on GetPodStatus(). I feel like your solution here is the best thing we can do now, but perhaps could we add a FIXME that it should be pushed to the runtimes to correctly report whether the pod was fully set up or not?

@dcbw
Copy link
Member

dcbw commented Sep 6, 2017

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 6, 2017
@k8s-github-robot k8s-github-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 7, 2017
@dcbw
Copy link
Member

dcbw commented Sep 7, 2017

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 7, 2017
@caseydavenport
Copy link
Member Author

/retest

@feiskyer
Copy link
Member

The fix LGTM.

@caseydavenport Could you fix the unit tests?

@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Sep 13, 2017
@k8s-github-robot k8s-github-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 13, 2017
@caseydavenport
Copy link
Member Author

/retest

@caseydavenport
Copy link
Member Author

I believe I've fixed the UTs (plus added a new one for this fix).

The last round of failures appeared to be unrelated at a glance.

@feiskyer
Copy link
Member

/retest

@dcbw
Copy link
Member

dcbw commented Sep 15, 2017

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 15, 2017
@caseydavenport
Copy link
Member Author

@dcbw given #48510 is in the 1.8 milestone can we get this PR added as well?

@feiskyer
Copy link
Member

given #48510 is in the 1.8 milestone can we get this PR added as well?

+1. This is fixing a v1.8 issues.

ping @Random-Liu for approval.

@dims
Copy link
Member

dims commented Sep 18, 2017

/kind bug

@jdumars @kubernetes/sig-release-members Can you please add the milestone 1.8 for this bug?

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Sep 18, 2017
@@ -396,6 +396,12 @@ func (m *kubeGenericRuntimeManager) podSandboxChanged(pod *v1.Pod, podStatus *ku
return true, sandboxStatus.Metadata.Attempt + 1, ""
}

// Needs to create a new sandbox when the sandbox does not have an IP address.
if !kubecontainer.IsHostNetworkPod(pod) && sandboxStatus.Network.Ip == "" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know why IP address is not allocated for the pod? My concern is that we paper-over the real issue here. But I guess this is best solution we could come up so far, I am ok with it as a temporary workaround.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We know of at least one scenario, which this PR is meant to fix - it's a condition in which the kubelet gets restarted during CNI execution and so the sandbox exists but doesn't yet have an IP address.

@dchen1107
Copy link
Member

/lgtm

@dchen1107 dchen1107 added this to the v1.8 milestone Sep 18, 2017
@k8s-github-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: caseydavenport, dcbw, dchen1107

Associated issue: 48510

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these OWNERS Files:

You can indicate your approval by writing /approve in a comment
You can cancel your approval by writing /approve cancel in a comment

@k8s-github-robot k8s-github-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 18, 2017
@k8s-github-robot
Copy link

/test all [submit-queue is verifying that this PR is safe to merge]

@fejta-bot
Copy link

/retest
This bot automatically retries jobs that failed/flaked on approved PRs (send feedback to @fejta).

Review the full test history for this PR.

@k8s-github-robot
Copy link

/test all [submit-queue is verifying that this PR is safe to merge]

@k8s-github-robot
Copy link

Automatic merge from submit-queue (batch tested with PRs 48970, 52497, 51367, 52549, 52541). If you want to cherry-pick this change to another branch, please follow the instructions here..

@k8s-github-robot k8s-github-robot merged commit f80999f into kubernetes:master Sep 19, 2017
@caseydavenport caseydavenport deleted the fix-kubelet-restart branch September 19, 2017 15:47
openshift-merge-robot added a commit to openshift/origin that referenced this pull request Oct 17, 2017
Automatic merge from submit-queue (batch tested with PRs 16889, 16865).

UPSTREAM: 53857: kubelet sync pod throws more detailed events

Also includes the following upstream dependant PRs:

UPSTREAM: 50350: Wait for container cleanup before deletion
UPSTREAM: 48970: Recreate pod sandbox when the sandbox does not have an IP address.
UPSTREAM: 48589: When faild create pod sandbox record event.
UPSTREAM: 48584: Move event type
UPSTREAM: 47599: Rerun init containers when the pod needs to be restarted

xrefs:
kubernetes/kubernetes#53857
kubernetes/kubernetes#50350
kubernetes/kubernetes#48970
kubernetes/kubernetes#48589
kubernetes/kubernetes#48584
kubernetes/kubernetes#47599
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. sig/network Categorizes an issue or PR as relevant to SIG Network. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants