k8sutil: add recovery container instead of overriding #1853

hongchaodeng · 2018-01-11T20:33:39Z

Also fix the ordering to apply pod policy on recovery containers too

hongchaodeng · 2018-01-12T01:48:34Z

Stress testing proves that this will fix the bug: https://jenkins-etcd.prod.coreos.systems/job/etcd-operator-flaketest/buildTimeTrend

hongchaodeng · 2018-01-12T01:49:01Z

@etcd-bot retest this please

fanminshi · 2018-01-12T17:29:15Z

pkg/util/k8sutil/k8sutil.go

@@ -221,14 +222,17 @@ func addOwnerRefToObject(o metav1.Object, r metav1.OwnerReference) {
 // It's special that it has new token, and might need recovery init containers
 func NewSeedMemberPod(clusterName string, ms etcdutil.MemberSet, m *etcdutil.Member, cs api.ClusterSpec, owner metav1.OwnerReference, backupURL *url.URL) *v1.Pod {
 	token := uuid.New()
-	pod := NewEtcdPod(m, ms.PeerURLPairs(), clusterName, "new", token, cs, owner)
+	pod := newEtcdPod(m, ms.PeerURLPairs(), clusterName, "new", token, cs)


why changing NewEtcdPod to newEtcdPod?

nvm, i saw there is a order change.

fanminshi · 2018-01-12T18:21:37Z

test this pr. still see the same issue.

$ kubectl -n e2e-8 logs -f etcd-operator -c etcd-restore-operator
time="2018-01-12T18:11:51Z" level=info msg="Go Version: go1.9.2"
time="2018-01-12T18:11:51Z" level=info msg="Go OS/Arch: linux/amd64"
time="2018-01-12T18:11:51Z" level=info msg="etcd-restore-operator Version: 0.8.1+git"
time="2018-01-12T18:11:51Z" level=info msg="Git SHA: 6a4b7da"

INFO[0010] e2e setup successfully
--- FAIL: TestBackupAndRestore (122.60s)
	crd_util.go:44: creating etcd cluster: test-etcd-backup-restore-1209716615978421639
	util.go:52: 2018-01-12 10:11:57.900884118 -0800 PST m=+11.655376108 waiting size (3), healthy etcd members: names ([])
	util.go:52: 2018-01-12 10:12:07.904904149 -0800 PST m=+21.659837139 waiting size (3), healthy etcd members: names ([])
	util.go:52: 2018-01-12 10:12:17.903870559 -0800 PST m=+31.659243549 waiting size (3), healthy etcd members: names ([])
	util.go:52: 2018-01-12 10:12:27.905310429 -0800 PST m=+41.661124419 waiting size (3), healthy etcd members: names ([test-etcd-backup-restore-1209716615978421639-0000 test-etcd-backup-restore-1209716615978421639-0001])
	util.go:52: 2018-01-12 10:12:37.904470489 -0800 PST m=+51.660725479 waiting size (3), healthy etcd members: names ([test-etcd-backup-restore-1209716615978421639-0000 test-etcd-backup-restore-1209716615978421639-0001])
	util.go:52: 2018-01-12 10:12:47.902166579 -0800 PST m=+61.658789569 waiting size (3), healthy etcd members: names ([test-etcd-backup-restore-1209716615978421639-0000 test-etcd-backup-restore-1209716615978421639-0001 test-etcd-backup-restore-1209716615978421639-0002])
	backup_restore_test.go:172: backup for cluster (test-etcd-backup-restore-1209716615978421639) has been saved
	util.go:52: 2018-01-12 10:12:58.942750659 -0800 PST m=+72.699887649 waiting size (3), healthy etcd members: names ([])
	util.go:52: 2018-01-12 10:13:08.945232582 -0800 PST m=+82.702810572 waiting size (3), healthy etcd members: names ([])
	util.go:52: 2018-01-12 10:13:18.94420554 -0800 PST m=+92.702224530 waiting size (3), healthy etcd members: names ([test-etcd-backup-restore-1209716615978421639-0000])
	util.go:52: 2018-01-12 10:13:28.947295488 -0800 PST m=+102.705755478 waiting size (3), healthy etcd members: names ([test-etcd-backup-restore-1209716615978421639-0000])
	util.go:52: 2018-01-12 10:13:38.941260132 -0800 PST m=+112.700088122 waiting size (3), healthy etcd members: names ([test-etcd-backup-restore-1209716615978421639-0000])
	util.go:52: 2018-01-12 10:13:48.943556123 -0800 PST m=+122.702825113 waiting size (3), healthy etcd members: names ([test-etcd-backup-restore-1209716615978421639-0000])
	util.go:52: 2018-01-12 10:13:58.945036381 -0800 PST m=+132.704746371 waiting size (3), healthy etcd members: names ([test-etcd-backup-restore-1209716615978421639-0000])
	backup_restore_test.go:220: failed to see restored etcd cluster(test-etcd-backup-restore-1209716615978421639) reach 3 members: still failing after 6 retries

Also fix the ordering to apply pod policy on recovery containers too

hongchaodeng · 2018-01-12T18:38:14Z

This PR will make it more stable, but it still encountered failure: https://jenkins-etcd.prod.coreos.systems/job/etcd-operator-flaketest/buildTimeTrend

So there are other bugs we need to dig.

fanminshi · 2018-01-12T18:57:00Z

pkg/util/k8sutil/k8sutil.go


+func NewEtcdPod(m *etcdutil.Member, initialCluster []string, clusterName, state, token string, cs api.ClusterSpec, owner metav1.OwnerReference) *v1.Pod {
+	pod := newEtcdPod(m, initialCluster, clusterName, state, token, cs)
+	applyPodPolicy(clusterName, pod, cs.Pod)
 	addOwnerRefToObject(pod.GetObjectMeta(), owner)


I think addOwnerRefToObject can stay in the newEtcdPod function. I don't think order of addOwnerRefToObject matters.

The addOwnerRefToObject() doesn't matter. But it's good to separate one param owner out of newEtcdPod().

fair enough.

fanminshi · 2018-01-12T19:00:39Z

pkg/util/k8sutil/k8sutil.go

 	if backupURL != nil {
 		addRecoveryToPod(pod, token, m, cs, backupURL)
 	}
+	applyPodPolicy(clusterName, pod, cs.Pod)
+	addOwnerRefToObject(pod.GetObjectMeta(), owner)


It seems to me the order of calling applyPodPolicy is important because this code reasons about init containers.

I don't think order of addOwnerRefToObject matters in this case. correct me if i am wrong.

hasbro17 · 2018-01-12T19:13:53Z

LGTM

fanminshi · 2018-01-12T19:14:28Z

lgtm

hongchaodeng force-pushed the fix_rec branch from 7c781be to b6eaa3e Compare January 11, 2018 20:42

hongchaodeng changed the title ~~k8sutil: add recovery container instead of overriding~~ [wip] k8sutil: add recovery container instead of overriding Jan 11, 2018

hongchaodeng force-pushed the fix_rec branch from b6eaa3e to 6a4b7da Compare January 11, 2018 22:52

hongchaodeng changed the title ~~[wip] k8sutil: add recovery container instead of overriding~~ k8sutil: add recovery container instead of overriding Jan 12, 2018

fanminshi reviewed Jan 12, 2018

View reviewed changes

k8sutil: add recovery container instead of overriding

1052342

Also fix the ordering to apply pod policy on recovery containers too

hongchaodeng force-pushed the fix_rec branch from 6a4b7da to 1052342 Compare January 12, 2018 18:24

fanminshi reviewed Jan 12, 2018

View reviewed changes

hongchaodeng merged commit f7b24d7 into coreos:master Jan 12, 2018

hongchaodeng deleted the fix_rec branch January 12, 2018 19:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

k8sutil: add recovery container instead of overriding #1853

k8sutil: add recovery container instead of overriding #1853

hongchaodeng commented Jan 11, 2018 •

edited

Loading

hongchaodeng commented Jan 12, 2018

hongchaodeng commented Jan 12, 2018

fanminshi Jan 12, 2018

fanminshi Jan 12, 2018

fanminshi commented Jan 12, 2018

hongchaodeng commented Jan 12, 2018

fanminshi Jan 12, 2018

hongchaodeng Jan 12, 2018

fanminshi Jan 12, 2018

fanminshi Jan 12, 2018

hasbro17 commented Jan 12, 2018

fanminshi commented Jan 12, 2018

k8sutil: add recovery container instead of overriding #1853

k8sutil: add recovery container instead of overriding #1853

Conversation

hongchaodeng commented Jan 11, 2018 • edited Loading

hongchaodeng commented Jan 12, 2018

hongchaodeng commented Jan 12, 2018

fanminshi Jan 12, 2018

Choose a reason for hiding this comment

fanminshi Jan 12, 2018

Choose a reason for hiding this comment

fanminshi commented Jan 12, 2018

hongchaodeng commented Jan 12, 2018

fanminshi Jan 12, 2018

Choose a reason for hiding this comment

hongchaodeng Jan 12, 2018

Choose a reason for hiding this comment

fanminshi Jan 12, 2018

Choose a reason for hiding this comment

fanminshi Jan 12, 2018

Choose a reason for hiding this comment

hasbro17 commented Jan 12, 2018

fanminshi commented Jan 12, 2018

hongchaodeng commented Jan 11, 2018 •

edited

Loading