Skip to content
This repository has been archived by the owner on Jul 30, 2021. It is now read-only.

Bump k8s v1.8.1 #729

Merged
merged 6 commits into from
Oct 18, 2017
Merged

Bump k8s v1.8.1 #729

merged 6 commits into from
Oct 18, 2017

Conversation

aaronlevy
Copy link
Contributor

I'll wait on bumping vendor libs to v1.8 until client-go has a more unified v1.8 targeted release.

cc @rphillips @dghubble

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 4, 2017
@aaronlevy
Copy link
Contributor Author

Also, running conformance tests separately - will hold off on merging until those pass too

@rphillips
Copy link
Contributor

lgtm

@dghubble
Copy link
Contributor

dghubble commented Oct 4, 2017

Tested PXE booted bare-metal setups with both flannel and with calico. Bootstrapping succeeds and control plane seems ok (nothing too atypical in logs), along with usual stuff (logs, port-forward, pod-to-pod, examples, etc.)

This didn't cover controller reboot survival (i.e. checkpointing). Seems to be where tests are failing.

@dghubble
Copy link
Contributor

dghubble commented Oct 4, 2017

Reboots were ok for me, maybe a testing artifact jsonpatch replace operation does not apply: doc is missing key: /spec/template/spec/nodeSelector. Shouldn't https://github.com/kubernetes-incubator/bootkube/blob/3a4626cc52baf1343bf62ad916f35aef04b2b673/e2e/checkpointer_test.go#L181 specify "node-role.kubernetes.io/node":"" since the comment says its for a worker?

@aaronlevy
Copy link
Contributor Author

Conformance has 2 failures need to check out:

[1061461.011458] golang[5]: [14168.336421] golang[5]: Summarizing 2 Failures:
[1061461.011722] golang[5]: [14168.336603] golang[5]: [Fail] [k8s.io] Pods [It] should be submitted and removed [Conformance] 
[1061461.012043] golang[5]: [14168.336787] golang[5]: /go/src/k8s.io/kubernetes/_output/local/go/src/k8s.io/kubernetes/test/e2e/common/pods.go:252
[1061461.012319] golang[5]: [14168.336972] golang[5]: [Fail] [k8s.io] Pods Extended [k8s.io] Delete Grace Period [It] should be submitted and removed [Conformance] [Flaky]

@dghubble I think we might have to bump the vendoring after all (or update the tests). Not super clear about the interaction of using "extensions/v1beta1" in the tests, vs creating the objects with v1.8 "apps/v1beta2"

@dghubble
Copy link
Contributor

dghubble commented Oct 5, 2017

Any idea if my suggestion will unblock the e2e tests, might be worth a shot. I hadn't looked at conformance tests yet.

@aaronlevy
Copy link
Contributor Author

I believe the intention there is to remove the nodeSelector altogether (so the daemonset ends up on all nodes) - we're not using node affinity in this case. I have a feeling the actual failure is due to some versioning interplay (e.g. using ExtensionsV1beta1() when it's been switched to apps/v1beta2.

@aaronlevy
Copy link
Contributor Author

coreosbot run e2e

@aaronlevy
Copy link
Contributor Author

@dghubble do you know how to re-trigger the e2e-calico test?

@dghubble
Copy link
Contributor

dghubble commented Oct 6, 2017

coreosbot run e2e calico

@dghubble
Copy link
Contributor

dghubble commented Oct 6, 2017

Happy to change it, its just the first phrase I defined.

@aaronlevy
Copy link
Contributor Author

@rphillips rphillips mentioned this pull request Oct 9, 2017
1 task
@dghubble
Copy link
Contributor

dghubble commented Oct 9, 2017

Both failures are related to the checkpointer. Any relation to the (not-yet-included) improvements made in #733 @thorfour?

@dghubble
Copy link
Contributor

dghubble commented Oct 9, 2017

@aaronlevy I see e2e and self-hosted etcd e2e are using "coreosbot run e2e", so I've changed calico-e2e to use the same. Docs are already correct.

@thorfour
Copy link
Contributor

thorfour commented Oct 9, 2017

My changes wouldn't be a fix for those failures. I believe I've seen those failures before, seems like a timing failure

@aaronlevy
Copy link
Contributor Author

I don't think it's related to checkpointer itself, but rather behavior that touches the checkpointer tests. It seems like there's something related to removing pods that has changed (similarly got same conformance test failures twice now - so not outright flakes).

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Oct 10, 2017
@dghubble
Copy link
Contributor

Looking at the tests, we're rebooting masters, starting workers, verifying "pod-checkpointer" is present on a worker (verifyPod with docker ps), and starting masters, and then verifying "pod-checkpointer" is NOT present on a worker (step that fails).
https://github.com/kubernetes-incubator/bootkube/blame/master/e2e/checkpointer_test.go#L277
https://github.com/kubernetes-incubator/bootkube/blame/master/e2e/checkpointer_test.go#L89

I'm guessing this is done to test the idea that once an apiserver can be contacted, the checkpointer doesn't need to be running on that worker per doc, but I'm not 100%. cc @yifan-gu

@aaronlevy
Copy link
Contributor Author

Added a sleep and the tests pass. I have a feeling removing a pod might have got slower? One option might just be to increase the retry window on verifyPod

@dghubble
Copy link
Contributor

Increasing the retries allowed does seem nicer than a fixed sleep.

@@ -161,7 +169,7 @@ spec:
- /var/lock/api-server.lock
- /hyperkube
- apiserver
- --admission-control=NamespaceLifecycle,LimitRanger,ServiceAccount,DefaultStorageClass,ResourceQuota
- --admission-control=NamespaceLifecycle,LimitRanger,ServiceAccount,PersistentVolumeLabel,DefaultStorageClass,ResourceQuota,DefaultTolerationSeconds

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quick question: do we have to take over these admission-control setting changes also in the Tectonic installer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cc'd you on an internal issue and will send a follow up email.

@dghubble
Copy link
Contributor

Looks like that cluster apiserver failed to come back and subsequent tests failed. Hm, known or new?

@aaronlevy
Copy link
Contributor Author

The self-hosted etcd is actually still occasionally flaky - and not really maintained. I'll re-run for now (last test with the sleep passed, and this is functionally the same change).

@dghubble
Copy link
Contributor

On bare-metal with bootkube render at this commit and --network-provider=experimental-calico, controller and full cluster reboots were tolerated as usual.

NAMESPACE     NAME                                       READY     STATUS    RESTARTS   AGE
kube-system   calico-node-c6596                          2/2       Running   0          8m
kube-system   calico-node-j2kdj                          2/2       Running   2          8m
kube-system   calico-node-w65hg                          2/2       Running   0          8m
kube-system   kube-apiserver-55ctg                       1/1       Running   1          8m
kube-system   kube-controller-manager-5dbc5d8c6b-bd2v5   1/1       Running   1          8m
kube-system   kube-controller-manager-5dbc5d8c6b-cmbh9   1/1       Running   1          8m
kube-system   kube-dns-598c789574-9vgnd                  3/3       Running   3          7m
kube-system   kube-proxy-8nwbt                           1/1       Running   1          7m
kube-system   kube-proxy-c8x54                           1/1       Running   0          7m
kube-system   kube-proxy-jgfg8                           1/1       Running   0          7m
kube-system   kube-scheduler-bdb68cdc-4tqzp              1/1       Running   1          7m
kube-system   kube-scheduler-bdb68cdc-5qhp5              1/1       Running   1          7m
kube-system   pod-checkpointer-85p29                     1/1       Running   1          7m
kube-system   pod-checkpointer-85p29-node1.example.com   1/1       Running   1          7m

rel: poseidon/matchbox#657

@dghubble
Copy link
Contributor

On CI, TestReboot failed once and TestCheckpointerUnscheduleCheckpointer failed once on separate runs. So its a flaking issue rather than being broken outright.

I'll point out, the reboot seems to fail sometimes at the testing code's call to node.Reboot, not from a cluster behavior itself. reboot_test.go:30: failed to reboot node: Process exited with status 1

@dghubble
Copy link
Contributor

coreosbot run e2e

@dghubble
Copy link
Contributor

Tested on a cloud provider as well

@dghubble dghubble self-requested a review October 13, 2017 21:36
@dghubble
Copy link
Contributor

e2e-selfetcd failed on Terraform creation of AWS resources. aws_instance.master_node.0: Error waiting for instance (i-0c6b293b70086d5fa) to become ready: Failed to reach target state. Reason: Server.InternalError: Internal error on launch

@dghubble
Copy link
Contributor

coreosbot run e2e

@aaronlevy
Copy link
Contributor Author

Any thoughts on how we might make this less flaky?

@dghubble
Copy link
Contributor

dghubble commented Oct 17, 2017

  • TestReboot failures sometimes occur simply because the reboot logic is literally try to ssh once and run "reboot". No retries.
  • Most recent nginx workload failure comes from https://github.com/coreos/ktestutil. Hasn't been touched in a while, old client-go? It hardcodes its own retry intervals and counts, so they're unaffected by your bumps. https://github.com/coreos/ktestutil/tree/master/testworkload
  • Terraform failures: I don't have much answer here. I have found that AWS is by far the most unreliable Terraform platform. No idea if that's ultimately the plugin or APIs or interactions. Its a lot of source code.

At a higher level and longer term level, I'd like to strip out functionality we don't intend to go forward with such as self-hosted kubelet (#735, but not an issue here) and self-hosted etcd. Most of these issues boil down to crash loops being a normal part of self-hosted control planes, our control plane growing more complex, and there being no precise definition of when setup is complete and e2e/conformance testing should begin. It'd also be valuable if we had some better way to show how long a step normally takes, we kinda go by instinct and maybe these values need re-tuning.

@dghubble
Copy link
Contributor

coreosbot run e2e

@aaronlevy
Copy link
Contributor Author

I'm really hesitant to merge this even if tests finally pass (without addressing some of the underlying issues). Otherwise every subsequent PR will have to run tests 5 times to pass (which isn't the case prior to this PR). So that might mean that we need to increase timeouts.

@dghubble
Copy link
Contributor

Agreed. At this point its mainly useful to see where we're hitting failures. It'd be different if it were the same error every time.

@klausenbusk
Copy link
Contributor

I'd like to strip out functionality we don't intend to go forward with such as self-hosted kubelet (#735, but not an issue here) and self-hosted etcd

Sorry for hijacking this issue, but why do you want to drop self-hosted etcd? Isn't it working? Too unstable? I can see it was just removed from the tectonic UI.

@dghubble
Copy link
Contributor

dghubble commented Oct 17, 2017

This isn't the place for that discussion, it would hijack this. I'll open an issue later to communicate the plan more broadly.

EDIT: See #738

@dghubble
Copy link
Contributor

Summarizing here, in the 5 builds on v1.8.1 (rather than v1.8.0)

  • e2e - 5/5 passes
  • e2e-calico - 4/5 passes
    • One failure on TestReboot due to ssh'ing once and not retrying (known test issue)
  • e2e-selfetcd - 2/5 passes
    • One failure due to Terraform AWS failing to create instance (known flaky)
    • One failure on TestSmoke waiting for nginx creation (unknown)
    • One failure on TestDeleteAPI and subsequent tests "non-checkpoint apiserver did not come back". (unknown)

@aaronlevy
Copy link
Contributor Author

All conformance tests passed once bumped to v1.8.1

* Latest license-bill-of-materials reformats list
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Oct 17, 2017
@dghubble
Copy link
Contributor

dghubble commented Oct 17, 2017

Added some fixes which should address the TestReboot and TestSmoke flakes. I don't have anything for the Terraform or occassional self-hosted etcd TestDeleteAPI failures. I haven't been able to reproduce them on my clusters running v1.8.1 with self-hosted etcd.

I'll note the flakes are in e2e-calico and e2e-selfetcd which are experimental, and we haven't seen any in e2e (standard options). Not that it justifies them.

@dghubble
Copy link
Contributor

Ok, so we've seen flaky failures in TestDeleteAPI "non-checkpoint apiserver did not come back" on both e2e-calico and e2e-selfetcd

@aaronlevy
Copy link
Contributor Author

Looks like all passed. Good to merge @dghubble ?

@dghubble
Copy link
Contributor

I'll keep trying to repro the flakes in experimental features.

Yes, good to merge.

@aaronlevy aaronlevy merged commit 9230a32 into kubernetes-retired:master Oct 18, 2017
@aaronlevy aaronlevy deleted the v1.8.0 branch October 18, 2017 18:27
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants