Skip to content
This repository has been archived by the owner on Mar 28, 2020. It is now read-only.

k8sutil: add recovery container instead of overriding #1853

Merged
merged 1 commit into from
Jan 12, 2018

Conversation

hongchaodeng
Copy link
Member

@hongchaodeng hongchaodeng commented Jan 11, 2018

Also fix the ordering to apply pod policy on recovery containers too

ref: #1825 (comment)

@hongchaodeng hongchaodeng changed the title k8sutil: add recovery container instead of overriding [wip] k8sutil: add recovery container instead of overriding Jan 11, 2018
@hongchaodeng
Copy link
Member Author

Stress testing proves that this will fix the bug: https://jenkins-etcd.prod.coreos.systems/job/etcd-operator-flaketest/buildTimeTrend

@hongchaodeng hongchaodeng changed the title [wip] k8sutil: add recovery container instead of overriding k8sutil: add recovery container instead of overriding Jan 12, 2018
@hongchaodeng
Copy link
Member Author

@etcd-bot retest this please

@@ -221,14 +222,17 @@ func addOwnerRefToObject(o metav1.Object, r metav1.OwnerReference) {
// It's special that it has new token, and might need recovery init containers
func NewSeedMemberPod(clusterName string, ms etcdutil.MemberSet, m *etcdutil.Member, cs api.ClusterSpec, owner metav1.OwnerReference, backupURL *url.URL) *v1.Pod {
token := uuid.New()
pod := NewEtcdPod(m, ms.PeerURLPairs(), clusterName, "new", token, cs, owner)
pod := newEtcdPod(m, ms.PeerURLPairs(), clusterName, "new", token, cs)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why changing NewEtcdPod to newEtcdPod?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm, i saw there is a order change.

@fanminshi
Copy link
Contributor

test this pr. still see the same issue.

$ kubectl -n e2e-8 logs -f etcd-operator -c etcd-restore-operator
time="2018-01-12T18:11:51Z" level=info msg="Go Version: go1.9.2"
time="2018-01-12T18:11:51Z" level=info msg="Go OS/Arch: linux/amd64"
time="2018-01-12T18:11:51Z" level=info msg="etcd-restore-operator Version: 0.8.1+git"
time="2018-01-12T18:11:51Z" level=info msg="Git SHA: 6a4b7da"
INFO[0010] e2e setup successfully
--- FAIL: TestBackupAndRestore (122.60s)
	crd_util.go:44: creating etcd cluster: test-etcd-backup-restore-1209716615978421639
	util.go:52: 2018-01-12 10:11:57.900884118 -0800 PST m=+11.655376108 waiting size (3), healthy etcd members: names ([])
	util.go:52: 2018-01-12 10:12:07.904904149 -0800 PST m=+21.659837139 waiting size (3), healthy etcd members: names ([])
	util.go:52: 2018-01-12 10:12:17.903870559 -0800 PST m=+31.659243549 waiting size (3), healthy etcd members: names ([])
	util.go:52: 2018-01-12 10:12:27.905310429 -0800 PST m=+41.661124419 waiting size (3), healthy etcd members: names ([test-etcd-backup-restore-1209716615978421639-0000 test-etcd-backup-restore-1209716615978421639-0001])
	util.go:52: 2018-01-12 10:12:37.904470489 -0800 PST m=+51.660725479 waiting size (3), healthy etcd members: names ([test-etcd-backup-restore-1209716615978421639-0000 test-etcd-backup-restore-1209716615978421639-0001])
	util.go:52: 2018-01-12 10:12:47.902166579 -0800 PST m=+61.658789569 waiting size (3), healthy etcd members: names ([test-etcd-backup-restore-1209716615978421639-0000 test-etcd-backup-restore-1209716615978421639-0001 test-etcd-backup-restore-1209716615978421639-0002])
	backup_restore_test.go:172: backup for cluster (test-etcd-backup-restore-1209716615978421639) has been saved
	util.go:52: 2018-01-12 10:12:58.942750659 -0800 PST m=+72.699887649 waiting size (3), healthy etcd members: names ([])
	util.go:52: 2018-01-12 10:13:08.945232582 -0800 PST m=+82.702810572 waiting size (3), healthy etcd members: names ([])
	util.go:52: 2018-01-12 10:13:18.94420554 -0800 PST m=+92.702224530 waiting size (3), healthy etcd members: names ([test-etcd-backup-restore-1209716615978421639-0000])
	util.go:52: 2018-01-12 10:13:28.947295488 -0800 PST m=+102.705755478 waiting size (3), healthy etcd members: names ([test-etcd-backup-restore-1209716615978421639-0000])
	util.go:52: 2018-01-12 10:13:38.941260132 -0800 PST m=+112.700088122 waiting size (3), healthy etcd members: names ([test-etcd-backup-restore-1209716615978421639-0000])
	util.go:52: 2018-01-12 10:13:48.943556123 -0800 PST m=+122.702825113 waiting size (3), healthy etcd members: names ([test-etcd-backup-restore-1209716615978421639-0000])
	util.go:52: 2018-01-12 10:13:58.945036381 -0800 PST m=+132.704746371 waiting size (3), healthy etcd members: names ([test-etcd-backup-restore-1209716615978421639-0000])
	backup_restore_test.go:220: failed to see restored etcd cluster(test-etcd-backup-restore-1209716615978421639) reach 3 members: still failing after 6 retries

Also fix the ordering to apply pod policy on recovery containers too
@hongchaodeng
Copy link
Member Author

This PR will make it more stable, but it still encountered failure: https://jenkins-etcd.prod.coreos.systems/job/etcd-operator-flaketest/buildTimeTrend

So there are other bugs we need to dig.


func NewEtcdPod(m *etcdutil.Member, initialCluster []string, clusterName, state, token string, cs api.ClusterSpec, owner metav1.OwnerReference) *v1.Pod {
pod := newEtcdPod(m, initialCluster, clusterName, state, token, cs)
applyPodPolicy(clusterName, pod, cs.Pod)
addOwnerRefToObject(pod.GetObjectMeta(), owner)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think addOwnerRefToObject can stay in the newEtcdPod function. I don't think order of addOwnerRefToObject matters.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The addOwnerRefToObject() doesn't matter. But it's good to separate one param owner out of newEtcdPod().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair enough.

if backupURL != nil {
addRecoveryToPod(pod, token, m, cs, backupURL)
}
applyPodPolicy(clusterName, pod, cs.Pod)
addOwnerRefToObject(pod.GetObjectMeta(), owner)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me the order of calling applyPodPolicy is important because this code reasons about init containers.

I don't think order of addOwnerRefToObject matters in this case. correct me if i am wrong.

@hasbro17
Copy link
Contributor

LGTM

@fanminshi
Copy link
Contributor

lgtm

@hongchaodeng hongchaodeng merged commit f7b24d7 into coreos:master Jan 12, 2018
@hongchaodeng hongchaodeng deleted the fix_rec branch January 12, 2018 19:15
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants