Timeout on instances.NodeAddresses cloud provider request #62543

ingvagabund · 2018-04-13T14:25:10Z

What this PR does / why we need it:

In cases the cloud provider does not respond before the node gets evicted.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

Release note:

stop kubelet to cloud provider integration potentially wedging kubelet sync loop

ingvagabund · 2018-04-13T14:30:45Z

@sjenning

derekwaynecarr · 2018-04-18T18:00:49Z

a call to the cloud provider wedging the kubelet is a problem. i wish the cloud provider apis had a defined timeout, but i recognize that is harder to push out.

derekwaynecarr · 2018-04-18T18:00:58Z

/kind bug

derekwaynecarr

can you clarify on how 3m default was selected?

derekwaynecarr · 2018-04-18T18:04:16Z

pkg/kubelet/kubelet.go

+		klet.cloudproviderRequestParallelism = make(chan int, 1)
+		klet.cloudproviderRequestSync = make(chan int)
+		// TODO(jchaloup): Make it configurable via --cloud-provider-request-timeout
+		klet.cloudproviderRequestTimeout = 3 * time.Minute


how did you settle on 3m?

i feel like this should be less than the heartbeat status interval

I agree, either 10s (status frequency) or 30s (10s less than the 40s timeout before the node is marked NotReady. I favor 10s.

There is also the fundamental issue of do we need to do this call on every status update. Seems like this information (hostame, internal ip, external ip) is pretty static.

how did you settle on 3m?

for testing purposes. I plan to make the timeout configurable. Or does it make more sense to hardcode it? 10s seems good.

sjenning · 2018-04-19T16:14:28Z

pkg/kubelet/kubelet_node_status.go

+		select {
+		case <-kl.cloudproviderRequestSync:
+		case <-time.After(kl.cloudproviderRequestTimeout):
+			err = fmt.Errorf("Timeout after %v", kl.cloudproviderRequestTimeout)


Might put this at V(2) and change to "timeout after %v trying to get instance information from cloud provider"

glog.V(2).Infof("timeout after %v trying to get instance information from cloud provider", kl.cloudproviderRequestTimeout) return nil

?

actually nevermind this whole thing. i didn't see the block after this and that we were just setting err here, not returning it.

ingvagabund · 2018-04-19T16:39:20Z

@derekwaynecarr @sjenning changed the timeout to 10 seconds.

sjenning · 2018-04-19T19:10:07Z

@ingvagabund you might need to rebase this to resolve the test failures. The tests don't seem to be broken across all PRs, but they are repeatably broken for this PR.

ingvagabund · 2018-04-20T09:59:29Z

Just rebasing. I will check the failed tests afterwards.

ingvagabund · 2018-04-20T10:59:45Z

/test pull-kubernetes-bazel-test

ingvagabund · 2018-04-21T15:05:40Z

No idea why the bazel fails:

W0420 11:18:29.104] TIMEOUT: //pkg/kubelet:go_default_test (Summary)
--
  | W0420 11:18:29.104]       /bazel-scratch/.cache/bazel/_bazel_root/e9f728bbd90b3fba632eb31b20e1dacd/execroot/__main__/bazel-out/k8-fastbuild/testlogs/pkg/kubelet/go_default_test/test_attempts/attempt_1.log
  | W0420 11:18:29.104]       /bazel-scratch/.cache/bazel/_bazel_root/e9f728bbd90b3fba632eb31b20e1dacd/execroot/__main__/bazel-out/k8-fastbuild/testlogs/pkg/kubelet/go_default_test/test_attempts/attempt_2.log
  | W0420 11:18:29.104]       /bazel-scratch/.cache/bazel/_bazel_root/e9f728bbd90b3fba632eb31b20e1dacd/execroot/__main__/bazel-out/k8-fastbuild/testlogs/pkg/kubelet/go_default_test/test.log
  | W0420 11:18:29.104] INFO: From Testing //pkg/kubelet:go_default_test:
  | W0420 11:18:29.132] INFO: Elapsed time: 987.597s, Critical Path: 908.95s
  | W0420 11:18:29.133] INFO: Build completed, 1 test FAILED, 9276 total actions
  | I0420 11:18:29.233] ==================== Test output for //pkg/kubelet:go_default_test:

Any hints what to look for?

ingvagabund · 2018-04-23T11:10:36Z

Reproducible locally by running bazel test --config=unit --build_tag_filters=-e2e,-integration --test_tag_filters=-e2e,-integration --flaky_test_attempts=3 //pkg/kubelet:go_default_test

ingvagabund · 2018-04-23T11:23:44Z

kubelet_node_status_test.go:TestNodeStatusWithCloudProviderNodeIP is failing

ingvagabund · 2018-04-23T12:01:13Z

/test pull-kubernetes-integration

ingvagabund · 2018-04-23T12:58:46Z

/test pull-kubernetes-local-e2e-containerized

ingvagabund · 2018-04-23T13:02:31Z

I0419 17:05:31.776] Unable to find image 'k8s.gcr.io/kubelet:latest' locally

Most likely the pull-kubernetes-local-e2e-containerized due to that.

ingvagabund · 2018-04-23T13:06:54Z

@sjenning @derekwaynecarr PTAL

dims · 2018-04-23T13:07:17Z

@ingvagabund you can safely ignore the pull-kubernetes-local-e2e-containerized job. it's not fully baked yet. (it's not a required job)

k8s-ci-robot · 2018-04-23T13:18:48Z

@ingvagabund: The following test failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
pull-kubernetes-local-e2e-containerized	`61efc29`	link	`/test pull-kubernetes-local-e2e-containerized`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

ingvagabund · 2018-04-23T13:33:16Z

@dims yeah, I checked the TestGrid and other PRs and it seems to be so true :) Thanks for verifying that.

derekwaynecarr · 2018-04-23T15:03:47Z

/lgtm
/approve

k8s-ci-robot · 2018-04-23T15:03:55Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: derekwaynecarr, ingvagabund

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~pkg/kubelet/OWNERS~~ [derekwaynecarr]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-github-robot · 2018-04-23T15:23:37Z

/test all [submit-queue is verifying that this PR is safe to merge]

k8s-github-robot · 2018-04-23T16:12:41Z

Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions here.

On kubelet start-up some functions to set the node status are generated. One of those functions propagates the node addresses into the `Node` object the kubelet is responsible for (`.status.addresses`). The kube-apiserver uses these addresses to talk to the actual node. To identify the IP address of the node the kubelet communicates with the cloud provider. kubernetes/kubernetes#62543 introduced a timeout of 10s when trying to connect to the cloud. In case the IP cannot be determined within 10s, the `Node` object does not report an `InternalIP` address. Consequently, the kube-apiserver will never be able to talk to that node; particularly VPN won't work in case the vpn-shoot pod is scheduled on it. Once the connection failed, it is never retried, and only a kubelet process restart can trigger it again. Hence, our kubelet monitoring script will now do the same when it cannot find an `InternalIP` or an `ExternalIP` address on the `Node` object. closes #283

On kubelet start-up some functions to set the node status are generated. One of those functions propagates the node addresses into the `Node` object the kubelet is responsible for (`.status.addresses`). The kube-apiserver uses these addresses to talk to the actual node. To identify the IP address of the node the kubelet communicates with the cloud provider. kubernetes/kubernetes#62543 introduced a timeout of 10s when trying to connect to the cloud. In case the IP cannot be determined within 10s, the `Node` object does not report an `InternalIP` address. Consequently, the kube-apiserver will never be able to talk to that node; particularly VPN won't work in case the vpn-shoot pod is scheduled on it. Once the connection failed, it is never retried, and only a kubelet process restart can trigger it again. Hence, our kubelet monitoring script will now do the same when it cannot find an `InternalIP` or an `ExternalIP` address on the `Node` object. closes gardener#283

k8s-ci-robot requested review from dims and pmorie April 13, 2018 14:25

ingvagabund force-pushed the timeout-on-cloud-provider-request branch 2 times, most recently from 0b6b6ff to 55e0b31 Compare April 13, 2018 14:38

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Apr 18, 2018

derekwaynecarr suggested changes Apr 18, 2018

View reviewed changes

sjenning reviewed Apr 19, 2018

View reviewed changes

ingvagabund force-pushed the timeout-on-cloud-provider-request branch from 55e0b31 to 478e3b8 Compare April 19, 2018 16:38

ingvagabund changed the title ~~WIP: Timeout on instances.NodeAddresses cloud provider request~~ Timeout on instances.NodeAddresses cloud provider request Apr 19, 2018

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 19, 2018

ingvagabund force-pushed the timeout-on-cloud-provider-request branch from 478e3b8 to d34b50d Compare April 20, 2018 09:58

Timeout on instances.NodeAddresses cloud provider request

61efc29

ingvagabund force-pushed the timeout-on-cloud-provider-request branch from d34b50d to 61efc29 Compare April 23, 2018 11:30

k8s-ci-robot assigned derekwaynecarr Apr 23, 2018

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Apr 23, 2018

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 23, 2018

k8s-github-robot merged commit 5b77996 into kubernetes:master Apr 23, 2018

ingvagabund deleted the timeout-on-cloud-provider-request branch April 24, 2018 10:30

ingvagabund mentioned this pull request May 16, 2018

UPSTREAM: 62543: Timeout on instances.NodeAddresses cloud provider request openshift/origin#19733

Merged

sjenning mentioned this pull request Jun 26, 2018

Store the latest cloud provider node addresses #65226

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timeout on instances.NodeAddresses cloud provider request #62543

Timeout on instances.NodeAddresses cloud provider request #62543

ingvagabund commented Apr 13, 2018 •

edited by derekwaynecarr

Loading

ingvagabund commented Apr 13, 2018

derekwaynecarr commented Apr 18, 2018

derekwaynecarr commented Apr 18, 2018

derekwaynecarr left a comment

derekwaynecarr Apr 18, 2018

sjenning Apr 18, 2018

ingvagabund Apr 19, 2018 •

edited

Loading

sjenning Apr 19, 2018

ingvagabund Apr 19, 2018

sjenning Apr 19, 2018

sjenning Apr 19, 2018

ingvagabund commented Apr 19, 2018 •

edited

Loading

sjenning commented Apr 19, 2018

ingvagabund commented Apr 20, 2018

ingvagabund commented Apr 20, 2018

ingvagabund commented Apr 21, 2018

ingvagabund commented Apr 23, 2018

ingvagabund commented Apr 23, 2018

ingvagabund commented Apr 23, 2018

ingvagabund commented Apr 23, 2018

ingvagabund commented Apr 23, 2018

ingvagabund commented Apr 23, 2018

dims commented Apr 23, 2018

k8s-ci-robot commented Apr 23, 2018 •

edited

Loading

ingvagabund commented Apr 23, 2018

derekwaynecarr commented Apr 23, 2018

k8s-ci-robot commented Apr 23, 2018

k8s-github-robot commented Apr 23, 2018

k8s-github-robot commented Apr 23, 2018

Timeout on instances.NodeAddresses cloud provider request #62543

Timeout on instances.NodeAddresses cloud provider request #62543

Conversation

ingvagabund commented Apr 13, 2018 • edited by derekwaynecarr Loading

ingvagabund commented Apr 13, 2018

derekwaynecarr commented Apr 18, 2018

derekwaynecarr commented Apr 18, 2018

derekwaynecarr left a comment

Choose a reason for hiding this comment

derekwaynecarr Apr 18, 2018

Choose a reason for hiding this comment

sjenning Apr 18, 2018

Choose a reason for hiding this comment

ingvagabund Apr 19, 2018 • edited Loading

Choose a reason for hiding this comment

sjenning Apr 19, 2018

Choose a reason for hiding this comment

ingvagabund Apr 19, 2018

Choose a reason for hiding this comment

sjenning Apr 19, 2018

Choose a reason for hiding this comment

sjenning Apr 19, 2018

Choose a reason for hiding this comment

ingvagabund commented Apr 19, 2018 • edited Loading

sjenning commented Apr 19, 2018

ingvagabund commented Apr 20, 2018

ingvagabund commented Apr 20, 2018

ingvagabund commented Apr 21, 2018

ingvagabund commented Apr 23, 2018

ingvagabund commented Apr 23, 2018

ingvagabund commented Apr 23, 2018

ingvagabund commented Apr 23, 2018

ingvagabund commented Apr 23, 2018

ingvagabund commented Apr 23, 2018

dims commented Apr 23, 2018

k8s-ci-robot commented Apr 23, 2018 • edited Loading

ingvagabund commented Apr 23, 2018

derekwaynecarr commented Apr 23, 2018

k8s-ci-robot commented Apr 23, 2018

k8s-github-robot commented Apr 23, 2018

k8s-github-robot commented Apr 23, 2018

ingvagabund commented Apr 13, 2018 •

edited by derekwaynecarr

Loading

ingvagabund Apr 19, 2018 •

edited

Loading

ingvagabund commented Apr 19, 2018 •

edited

Loading

k8s-ci-robot commented Apr 23, 2018 •

edited

Loading