Bump k8s v1.8.1 #729

aaronlevy · 2017-10-04T17:20:15Z

I'll wait on bumping vendor libs to v1.8 until client-go has a more unified v1.8 targeted release.

aaronlevy · 2017-10-04T17:20:37Z

Also, running conformance tests separately - will hold off on merging until those pass too

rphillips · 2017-10-04T17:40:00Z

lgtm

dghubble · 2017-10-04T18:24:11Z

Tested PXE booted bare-metal setups with both flannel and with calico. Bootstrapping succeeds and control plane seems ok (nothing too atypical in logs), along with usual stuff (logs, port-forward, pod-to-pod, examples, etc.)

This didn't cover controller reboot survival (i.e. checkpointing). Seems to be where tests are failing.

dghubble · 2017-10-04T20:10:21Z

Reboots were ok for me, maybe a testing artifact jsonpatch replace operation does not apply: doc is missing key: /spec/template/spec/nodeSelector. Shouldn't https://github.com/kubernetes-incubator/bootkube/blob/3a4626cc52baf1343bf62ad916f35aef04b2b673/e2e/checkpointer_test.go#L181 specify "node-role.kubernetes.io/node":"" since the comment says its for a worker?

aaronlevy · 2017-10-05T17:06:34Z

Conformance has 2 failures need to check out:

[1061461.011458] golang[5]: [14168.336421] golang[5]: Summarizing 2 Failures:
[1061461.011722] golang[5]: [14168.336603] golang[5]: [Fail] [k8s.io] Pods [It] should be submitted and removed [Conformance] 
[1061461.012043] golang[5]: [14168.336787] golang[5]: /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/common/pods.go:252
[1061461.012319] golang[5]: [14168.336972] golang[5]: [Fail] [k8s.io] Pods Extended [k8s.io] Delete Grace Period [It] should be submitted and removed [Conformance] [Flaky]

@dghubble I think we might have to bump the vendoring after all (or update the tests). Not super clear about the interaction of using "extensions/v1beta1" in the tests, vs creating the objects with v1.8 "apps/v1beta2"

dghubble · 2017-10-05T20:26:45Z

Any idea if my suggestion will unblock the e2e tests, might be worth a shot. I hadn't looked at conformance tests yet.

aaronlevy · 2017-10-05T23:11:38Z

I believe the intention there is to remove the nodeSelector altogether (so the daemonset ends up on all nodes) - we're not using node affinity in this case. I have a feeling the actual failure is due to some versioning interplay (e.g. using ExtensionsV1beta1() when it's been switched to apps/v1beta2.

aaronlevy · 2017-10-06T23:20:04Z

coreosbot run e2e

aaronlevy · 2017-10-06T23:20:52Z

@dghubble do you know how to re-trigger the e2e-calico test?

dghubble · 2017-10-06T23:22:31Z

coreosbot run e2e calico

dghubble · 2017-10-06T23:23:22Z

Happy to change it, its just the first phrase I defined.

aaronlevy · 2017-10-06T23:25:10Z

@dghubble that's fine. Mind adding it to https://github.com/kubernetes-incubator/bootkube/blob/master/Documentation/development.md#running-pr-tests

dghubble · 2017-10-09T23:25:53Z

Both failures are related to the checkpointer. Any relation to the (not-yet-included) improvements made in #733 @thorfour?

dghubble · 2017-10-09T23:32:09Z

@aaronlevy I see e2e and self-hosted etcd e2e are using "coreosbot run e2e", so I've changed calico-e2e to use the same. Docs are already correct.

thorfour · 2017-10-09T23:32:32Z

My changes wouldn't be a fix for those failures. I believe I've seen those failures before, seems like a timing failure

aaronlevy · 2017-10-09T23:51:41Z

I don't think it's related to checkpointer itself, but rather behavior that touches the checkpointer tests. It seems like there's something related to removing pods that has changed (similarly got same conformance test failures twice now - so not outright flakes).

dghubble · 2017-10-10T00:27:51Z

Looking at the tests, we're rebooting masters, starting workers, verifying "pod-checkpointer" is present on a worker (verifyPod with docker ps), and starting masters, and then verifying "pod-checkpointer" is NOT present on a worker (step that fails).
https://github.com/kubernetes-incubator/bootkube/blame/master/e2e/checkpointer_test.go#L277
https://github.com/kubernetes-incubator/bootkube/blame/master/e2e/checkpointer_test.go#L89

I'm guessing this is done to test the idea that once an apiserver can be contacted, the checkpointer doesn't need to be running on that worker per doc, but I'm not 100%. cc @yifan-gu

aaronlevy · 2017-10-10T00:44:46Z

Added a sleep and the tests pass. I have a feeling removing a pod might have got slower? One option might just be to increase the retry window on verifyPod

dghubble · 2017-10-10T00:47:22Z

Increasing the retries allowed does seem nicer than a fixed sleep.

s-urbaniak · 2017-10-10T13:12:54Z

pkg/asset/internal/templates.go

@@ -161,7 +169,7 @@ spec:
        - /var/lock/api-server.lock
        - /hyperkube
        - apiserver
-        - --admission-control=NamespaceLifecycle,LimitRanger,ServiceAccount,DefaultStorageClass,ResourceQuota
+        - --admission-control=NamespaceLifecycle,LimitRanger,ServiceAccount,PersistentVolumeLabel,DefaultStorageClass,ResourceQuota,DefaultTolerationSeconds


quick question: do we have to take over these admission-control setting changes also in the Tectonic installer?

I cc'd you on an internal issue and will send a follow up email.

dghubble · 2017-10-10T23:53:22Z

Looks like that cluster apiserver failed to come back and subsequent tests failed. Hm, known or new?

aaronlevy · 2017-10-11T00:03:13Z

The self-hosted etcd is actually still occasionally flaky - and not really maintained. I'll re-run for now (last test with the sleep passed, and this is functionally the same change).

dghubble · 2017-10-13T20:49:22Z

On bare-metal with bootkube render at this commit and --network-provider=experimental-calico, controller and full cluster reboots were tolerated as usual.

NAMESPACE     NAME                                       READY     STATUS    RESTARTS   AGE
kube-system   calico-node-c6596                          2/2       Running   0          8m
kube-system   calico-node-j2kdj                          2/2       Running   2          8m
kube-system   calico-node-w65hg                          2/2       Running   0          8m
kube-system   kube-apiserver-55ctg                       1/1       Running   1          8m
kube-system   kube-controller-manager-5dbc5d8c6b-bd2v5   1/1       Running   1          8m
kube-system   kube-controller-manager-5dbc5d8c6b-cmbh9   1/1       Running   1          8m
kube-system   kube-dns-598c789574-9vgnd                  3/3       Running   3          7m
kube-system   kube-proxy-8nwbt                           1/1       Running   1          7m
kube-system   kube-proxy-c8x54                           1/1       Running   0          7m
kube-system   kube-proxy-jgfg8                           1/1       Running   0          7m
kube-system   kube-scheduler-bdb68cdc-4tqzp              1/1       Running   1          7m
kube-system   kube-scheduler-bdb68cdc-5qhp5              1/1       Running   1          7m
kube-system   pod-checkpointer-85p29                     1/1       Running   1          7m
kube-system   pod-checkpointer-85p29-node1.example.com   1/1       Running   1          7m

rel: poseidon/matchbox#657

dghubble · 2017-10-13T20:54:50Z

On CI, TestReboot failed once and TestCheckpointerUnscheduleCheckpointer failed once on separate runs. So its a flaking issue rather than being broken outright.

I'll point out, the reboot seems to fail sometimes at the testing code's call to node.Reboot, not from a cluster behavior itself. reboot_test.go:30: failed to reboot node: Process exited with status 1

dghubble · 2017-10-13T20:56:19Z

coreosbot run e2e

dghubble · 2017-10-13T21:36:11Z

Tested on a cloud provider as well

dghubble · 2017-10-16T17:21:30Z

e2e-selfetcd failed on Terraform creation of AWS resources. aws_instance.master_node.0: Error waiting for instance (i-0c6b293b70086d5fa) to become ready: Failed to reach target state. Reason: Server.InternalError: Internal error on launch

dghubble · 2017-10-16T17:21:40Z

coreosbot run e2e

aaronlevy · 2017-10-16T23:34:56Z

Any thoughts on how we might make this less flaky?

dghubble · 2017-10-17T00:01:55Z

TestReboot failures sometimes occur simply because the reboot logic is literally try to ssh once and run "reboot". No retries.
Most recent nginx workload failure comes from https://github.com/coreos/ktestutil. Hasn't been touched in a while, old client-go? It hardcodes its own retry intervals and counts, so they're unaffected by your bumps. https://github.com/coreos/ktestutil/tree/master/testworkload
Terraform failures: I don't have much answer here. I have found that AWS is by far the most unreliable Terraform platform. No idea if that's ultimately the plugin or APIs or interactions. Its a lot of source code.

At a higher level and longer term level, I'd like to strip out functionality we don't intend to go forward with such as self-hosted kubelet (#735, but not an issue here) and self-hosted etcd. Most of these issues boil down to crash loops being a normal part of self-hosted control planes, our control plane growing more complex, and there being no precise definition of when setup is complete and e2e/conformance testing should begin. It'd also be valuable if we had some better way to show how long a step normally takes, we kinda go by instinct and maybe these values need re-tuning.

dghubble · 2017-10-17T00:05:09Z

coreosbot run e2e

aaronlevy · 2017-10-17T00:10:47Z

I'm really hesitant to merge this even if tests finally pass (without addressing some of the underlying issues). Otherwise every subsequent PR will have to run tests 5 times to pass (which isn't the case prior to this PR). So that might mean that we need to increase timeouts.

dghubble · 2017-10-17T00:17:04Z

Agreed. At this point its mainly useful to see where we're hitting failures. It'd be different if it were the same error every time.

klausenbusk · 2017-10-17T09:58:34Z

I'd like to strip out functionality we don't intend to go forward with such as self-hosted kubelet (#735, but not an issue here) and self-hosted etcd

Sorry for hijacking this issue, but why do you want to drop self-hosted etcd? Isn't it working? Too unstable? I can see it was just removed from the tectonic UI.

dghubble · 2017-10-17T13:49:25Z

This isn't the place for that discussion, it would hijack this. I'll open an issue later to communicate the plan more broadly.

EDIT: See #738

dghubble · 2017-10-17T18:36:04Z

Summarizing here, in the 5 builds on v1.8.1 (rather than v1.8.0)

e2e - 5/5 passes
e2e-calico - 4/5 passes
- One failure on TestReboot due to ssh'ing once and not retrying (known test issue)
e2e-selfetcd - 2/5 passes
- One failure due to Terraform AWS failing to create instance (known flaky)
- One failure on TestSmoke waiting for nginx creation (unknown)
- One failure on TestDeleteAPI and subsequent tests "non-checkpoint apiserver did not come back". (unknown)

aaronlevy · 2017-10-17T19:11:03Z

All conformance tests passed once bumped to v1.8.1

* Latest license-bill-of-materials reformats list

* Increase timeout allowed for the TestSmoke nginx workload

dghubble · 2017-10-17T23:27:18Z

Added some fixes which should address the TestReboot and TestSmoke flakes. I don't have anything for the Terraform or occassional self-hosted etcd TestDeleteAPI failures. I haven't been able to reproduce them on my clusters running v1.8.1 with self-hosted etcd.

I'll note the flakes are in e2e-calico and e2e-selfetcd which are experimental, and we haven't seen any in e2e (standard options). Not that it justifies them.

dghubble · 2017-10-17T23:55:06Z

Ok, so we've seen flaky failures in TestDeleteAPI "non-checkpoint apiserver did not come back" on both e2e-calico and e2e-selfetcd

aaronlevy · 2017-10-18T00:35:42Z

Looks like all passed. Good to merge @dghubble ?

dghubble · 2017-10-18T18:14:08Z

I'll keep trying to repro the flakes in experimental features.

Yes, good to merge.

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 4, 2017

aaronlevy force-pushed the v1.8.0 branch from 34f3b35 to 339daa5 Compare October 4, 2017 17:32

aaronlevy force-pushed the v1.8.0 branch from 9cb06c2 to 9ca7503 Compare October 6, 2017 22:13

rphillips mentioned this pull request Oct 9, 2017

WIP: bump k8s to 1.8.0-beta.1 #709

Closed

1 task

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 10, 2017

s-urbaniak reviewed Oct 10, 2017

View reviewed changes

aaronlevy force-pushed the v1.8.0 branch from 5db1da7 to 1ecbf6d Compare October 10, 2017 21:47

dghubble self-requested a review October 13, 2017 21:36

dghubble approved these changes Oct 13, 2017

View reviewed changes

dghubble mentioned this pull request Oct 17, 2017

Increase workload retry internval and timeout coreos/ktestutil#18

Merged

vendor: Pin unset vendor dependencies

efc7eea

* Latest license-bill-of-materials reformats list

dghubble mentioned this pull request Oct 17, 2017

Reduce test flakiness for Kubernetes v1.8.1 #737

Closed

dghubble added 2 commits October 17, 2017 16:05

glide.yaml: Update ktestutil retry interval and timeout

f357148

* Increase timeout allowed for the TestSmoke nginx workload

e2e: Add retries for ssh-based node rebooting

e47b760

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 17, 2017

aaronlevy merged commit 9230a32 into kubernetes-retired:master Oct 18, 2017

aaronlevy deleted the v1.8.0 branch October 18, 2017 18:27

Bump k8s v1.8.1 #729

Bump k8s v1.8.1 #729

Conversation

aaronlevy commented Oct 4, 2017

aaronlevy commented Oct 4, 2017

rphillips commented Oct 4, 2017

dghubble commented Oct 4, 2017 • edited Loading

dghubble commented Oct 4, 2017 • edited Loading

aaronlevy commented Oct 5, 2017

dghubble commented Oct 5, 2017

aaronlevy commented Oct 5, 2017

aaronlevy commented Oct 6, 2017

aaronlevy commented Oct 6, 2017

dghubble commented Oct 6, 2017

dghubble commented Oct 6, 2017

aaronlevy commented Oct 6, 2017

dghubble commented Oct 9, 2017

dghubble commented Oct 9, 2017

thorfour commented Oct 9, 2017

aaronlevy commented Oct 9, 2017

dghubble commented Oct 10, 2017

aaronlevy commented Oct 10, 2017

dghubble commented Oct 10, 2017

s-urbaniak Oct 10, 2017

Choose a reason for hiding this comment

aaronlevy Oct 10, 2017

Choose a reason for hiding this comment

dghubble commented Oct 10, 2017

aaronlevy commented Oct 11, 2017

dghubble commented Oct 13, 2017

dghubble commented Oct 13, 2017

dghubble commented Oct 13, 2017

dghubble commented Oct 13, 2017

dghubble commented Oct 16, 2017

dghubble commented Oct 16, 2017

aaronlevy commented Oct 16, 2017

dghubble commented Oct 17, 2017 • edited Loading

dghubble commented Oct 17, 2017

aaronlevy commented Oct 17, 2017

dghubble commented Oct 17, 2017

klausenbusk commented Oct 17, 2017

dghubble commented Oct 17, 2017 • edited Loading

dghubble commented Oct 17, 2017

aaronlevy commented Oct 17, 2017

dghubble commented Oct 17, 2017 • edited Loading

dghubble commented Oct 17, 2017

aaronlevy commented Oct 18, 2017

dghubble commented Oct 18, 2017

dghubble commented Oct 4, 2017 •

edited

Loading

dghubble commented Oct 4, 2017 •

edited

Loading

dghubble commented Oct 17, 2017 •

edited

Loading

dghubble commented Oct 17, 2017 •

edited

Loading

dghubble commented Oct 17, 2017 •

edited

Loading