selfHosted: fix backup unable to talk to etcd pods #1108

hongchaodeng · 2017-05-18T16:41:34Z

Currently Backup Sidecar is not able to talk to etcd pods because
it is using the FQDN which isn't supported in self hosted yet.
For self-hosted case, we are going to use PodIP.

fix #1107

xiang90 · 2017-05-18T16:42:42Z

Can we add a test for self hosted etcd + backup?

hongchaodeng · 2017-05-18T16:44:00Z

Right. I am going to if the solution is right.
This is an early submission about potential solution.

xiang90 · 2017-05-18T16:56:15Z

@rimusz it would be great if you can give this patch a try.

rimusz · 2017-05-18T16:59:55Z

I can try it.
is there an easy way to build etcd-operator binary?

hongchaodeng · 2017-05-18T17:01:33Z

@rimusz

https://github.com/coreos/etcd-operator/blob/master/doc/dev/developer_guide.md

rimusz · 2017-05-18T18:36:08Z

@hongchaodeng fix is working

rimusz · 2017-05-18T19:02:02Z

ok, it only works on vagrant cluster, tested again on StackPointCloud cluster, still the same grpc issue:

kube-etcd-backup-sidecar-1736258304-scj2c backup time="2017-05-18T19:00:49Z" level=warning msg="failed to create etcd client for pod (kube-etcd-0000): grpc: timed out when dialing"
kube-etcd-backup-sidecar-1736258304-scj2c backup time="2017-05-18T19:00:54Z" level=warning msg="failed to create etcd client for pod (kube-etcd-0001): grpc: timed out when dialing"
kube-etcd-backup-sidecar-1736258304-scj2c backup time="2017-05-18T19:00:59Z" level=warning msg="failed to create etcd client for pod (kube-etcd-0002): grpc: timed out when dialing"
kube-etcd-backup-sidecar-1736258304-scj2c backup time="2017-05-18T19:00:59Z" level=warning msg="no reachable member"
kube-etcd-backup-sidecar-1736258304-scj2c backup time="2017-05-18T19:00:59Z" level=error msg="failed to save snapshot: no reachable member"

rimusz · 2017-05-18T19:08:23Z

could be related to operator update from 0.2.4->0.2.6->0.3.1(with the fix) ?

hongchaodeng · 2017-05-18T19:25:23Z

This potential fix as shown here isn't backwards compatible.

hongchaodeng · 2017-05-18T19:41:59Z

Change the method to be upgrade compatible.

hongchaodeng · 2017-05-18T19:50:11Z

@rimusz
Can you try again with latest fix?

rimusz · 2017-05-18T19:59:52Z

sure, will give it a go tomorrow morning

hongchaodeng · 2017-05-18T21:08:07Z

@xiang90
Test added

xiang90 · 2017-05-18T22:12:51Z

test/e2e/self_hosted_test.go

-		t.Run("migrate boot member to self hosted cluster", testCreateSelfHostedClusterWithBootMember)
+	t.Run("create self hosted cluster from scratch", testCreateSelfHostedCluster)
+	t.Run("migrate boot member to self hosted cluster", testCreateSelfHostedClusterWithBootMember)
+	t.Run("backup for self hosted cluster", func(t *testing.T) {


make this a separate test?

We need to clean up after all self hosted test

any reason we cannot separate it out now? i do not want to make it worse.

We have a cleanup function at the end of the outer layer test.

just duplicate it for now.

The cleanup function is not simple.
It sets up pods on each node, runs commands, waits them to complete. It's not worth to do it again just for code structure.

rimusz · 2017-05-19T13:08:48Z

ok, tested it with different backup/restore scenarios for in-cluster etcd cluster, everything works just fine.
But for self-hosted etcd (for k8s API) via bootkube, backups work now, but if I delete 2 etcd pods, cluster never recovers. so restore part is still buggy.

also noticed if I update cluster manifest with the backup options operator deletes all etcd pods,
if operator gets updated with the backup features, then cluster, all is fine, backups start working.

etcd-operator some logs:

$ docker logs -f fabce0bcc1e8
time="2017-05-19T14:05:25Z" level=info msg="etcd-operator Version: 0.3.0+git"
time="2017-05-19T14:05:25Z" level=info msg="Git SHA: 369d01c"
time="2017-05-19T14:05:25Z" level=info msg="Go Version: go1.8.1"
time="2017-05-19T14:05:25Z" level=info msg="Go OS/Arch: linux/amd64"
E0519 14:05:32.686095       1 election.go:259] Failed to update lock: etcdserver: request timed out
E0519 14:05:44.562267       1 election.go:259] Failed to update lock: etcdserver: request timed out

last remaining etcd pod logs:

2017-05-19 14:30:01.080147 W | etcdserver: timed out waiting for read index response
2017-05-19 14:30:01.556119 I | raft: 16f4bc0651f13a46 is starting a new election at term 1070
2017-05-19 14:30:01.556253 I | raft: 16f4bc0651f13a46 became candidate at term 1071
2017-05-19 14:30:01.556280 I | raft: 16f4bc0651f13a46 received MsgVoteResp from 16f4bc0651f13a46 at term 1071
2017-05-19 14:30:01.556305 I | raft: 16f4bc0651f13a46 [logterm: 27, index: 15684] sent MsgVote request to a724aeed47064c66 at term 1071
2017-05-19 14:30:01.556326 I | raft: 16f4bc0651f13a46 [logterm: 27, index: 15684] sent MsgVote request to 7cacbdfa5be01295 at term 1071
2017-05-19 14:30:03.156039 I | raft: 16f4bc0651f13a46 is starting a new election at term 1071
2017-05-19 14:30:03.156073 I | raft: 16f4bc0651f13a46 became candidate at term 1072
2017-05-19 14:30:03.156084 I | raft: 16f4bc0651f13a46 received MsgVoteResp from 16f4bc0651f13a46 at term 1072
2017-05-19 14:30:03.156093 I | raft: 16f4bc0651f13a46 [logterm: 27, index: 15684] sent MsgVote request to a724aeed47064c66 at term 1072
2017-05-19 14:30:03.156102 I | raft: 16f4bc0651f13a46 [logterm: 27, index: 15684] sent MsgVote request to 7cacbdfa5be01295 at term 1072
2017-05-19 14:30:05.056077 I | raft: 16f4bc0651f13a46 is starting a new election at term 1072
2017-05-19 14:30:05.056119 I | raft: 16f4bc0651f13a46 became candidate at term 1073
2017-05-19 14:30:05.056132 I | raft: 16f4bc0651f13a46 received MsgVoteResp from 16f4bc0651f13a46 at term 1073
2017-05-19 14:30:05.056142 I | raft: 16f4bc0651f13a46 [logterm: 27, index: 15684] sent MsgVote request to 7cacbdfa5be01295 at term 1073
2017-05-19 14:30:05.056174 I | raft: 16f4bc0651f13a46 [logterm: 27, index: 15684] sent MsgVote request to a724aeed47064c66 at term 1073
2017-05-19 14:30:05.334402 W | rafthttp: health check for peer a724aeed47064c66 could not connect: dial tcp 172.23.5.59:2380: getsockopt: connection refused
2017-05-19 14:30:05.861240 W | rafthttp: health check for peer 7cacbdfa5be01295 could not connect: dial tcp 172.23.5.208:2380: getsockopt: connection refused
2017-05-19 14:30:06.356034 I | raft: 16f4bc0651f13a46 is starting a new election at term 1073
2017-05-19 14:30:06.356067 I | raft: 16f4bc0651f13a46 became candidate at term 1074
2017-05-19 14:30:06.356077 I | raft: 16f4bc0651f13a46 received MsgVoteResp from 16f4bc0651f13a46 at term 1074
2017-05-19 14:30:06.356087 I | raft: 16f4bc0651f13a46 [logterm: 27, index: 15684] sent MsgVote request to 7cacbdfa5be01295 at term 1074
2017-05-19 14:30:06.356095 I | raft: 16f4bc0651f13a46 [logterm: 27, index: 15684] sent MsgVote request to a724aeed47064c66 at term 1074
2017-05-19 14:30:07.356045 I | raft: 16f4bc0651f13a46 is starting a new election at term 1074
2017-05-19 14:30:07.356075 I | raft: 16f4bc0651f13a46 became candidate at term 1075
2017-05-19 14:30:07.356085 I | raft: 16f4bc0651f13a46 received MsgVoteResp from 16f4bc0651f13a46 at term 1075
2017-05-19 14:30:07.356094 I | raft: 16f4bc0651f13a46 [logterm: 27, index: 15684] sent MsgVote request to 7cacbdfa5be01295 at term 1075
2017-05-19 14:30:07.356102 I | raft: 16f4bc0651f13a46 [logterm: 27, index: 15684] sent MsgVote request to a724aeed47064c66 at term 1075
2017-05-19 14:30:08.080374 W | etcdserver: timed out waiting for read index response
2017-05-19 14:30:08.656020 I | raft: 16f4bc0651f13a46 is starting a new election at term 1075
2017-05-19 14:30:08.656052 I | raft: 16f4bc0651f13a46 became candidate at term 1076
2017-05-19 14:30:08.656061 I | raft: 16f4bc0651f13a46 received MsgVoteResp from 16f4bc0651f13a46 at term 1076
2017-05-19 14:30:08.656070 I | raft: 16f4bc0651f13a46 [logterm: 27, index: 15684] sent MsgVote request to 7cacbdfa5be01295 at term 1076
2017-05-19 14:30:08.656078 I | raft: 16f4bc0651f13a46 [logterm: 27, index: 15684] sent MsgVote request to a724aeed47064c66 at term 1076
2017-05-19 14:30:09.855966 I | raft: 16f4bc0651f13a46 is starting a new election at term 1076

hongchaodeng · 2017-05-19T15:58:49Z

@rimusz
For this PR, it fixes the issue #1107 . In other words, the sidecar couldn't make any backup.

For bootkube, if you deleted or killed majority nodes (or etcd pods), you have to follow bootkube's instructions to recover a downed cluster: https://github.com/kubernetes-incubator/bootkube/#recover-a-downed-cluster . If you are interested, please ask in bootkube repo.

rimusz · 2017-05-19T16:00:55Z

@hongchaodeng I did not kill nodes, I killed etcd pods for testing.
And yes this PR, it fixes the issue #1107 so LGTM

hongchaodeng · 2017-05-19T16:23:01Z

Yeah. We usually refer to nodes because we spread etcd pods across nodes.
Cool. Good to hear that it fixes your issue.

xiang90 · 2017-05-19T16:23:49Z

lgtm

rimusz · 2017-05-19T16:32:00Z

cool, let's get it merged then :)

Currently Backup Sidecar is not able to talk to etcd pods because it is using the FQDN which isn't supported in self hosted yet. For self-hosted case, we are going to use PodIP.

hongchaodeng mentioned this pull request May 18, 2017

etcd-backup-sidecar does not work with bootkube selfhosted etcd #1107

Closed

hongchaodeng force-pushed the f branch from 6d86547 to 330320d Compare May 18, 2017 19:39

hongchaodeng changed the title ~~selfHosted: fix service unreachability~~ selfHosted: fix backup unable to talk to etcd pods May 18, 2017

hongchaodeng force-pushed the f branch from 330320d to 0de6a24 Compare May 18, 2017 19:58

hongchaodeng force-pushed the f branch from 0de6a24 to 9615273 Compare May 18, 2017 21:03

xiang90 reviewed May 18, 2017

View reviewed changes

hongchaodeng force-pushed the f branch from 9615273 to 369d01c Compare May 18, 2017 22:32

hongchaodeng force-pushed the f branch from 369d01c to c8d72a5 Compare May 19, 2017 16:19

hongchaodeng force-pushed the f branch from c8d72a5 to b3e5b97 Compare May 19, 2017 17:07

selfHosted: fix backup unable to talk to etcd pods

74ac973

Currently Backup Sidecar is not able to talk to etcd pods because it is using the FQDN which isn't supported in self hosted yet. For self-hosted case, we are going to use PodIP.

hongchaodeng force-pushed the f branch from b3e5b97 to 74ac973 Compare May 19, 2017 17:12

hongchaodeng merged commit e7eccd9 into coreos:master May 19, 2017

hongchaodeng deleted the f branch May 19, 2017 17:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

selfHosted: fix backup unable to talk to etcd pods #1108

selfHosted: fix backup unable to talk to etcd pods #1108

hongchaodeng commented May 18, 2017 •

edited

Loading

xiang90 commented May 18, 2017

hongchaodeng commented May 18, 2017 •

edited

Loading

xiang90 commented May 18, 2017

rimusz commented May 18, 2017

hongchaodeng commented May 18, 2017

rimusz commented May 18, 2017

rimusz commented May 18, 2017

rimusz commented May 18, 2017 •

edited

Loading

hongchaodeng commented May 18, 2017

hongchaodeng commented May 18, 2017

hongchaodeng commented May 18, 2017

rimusz commented May 18, 2017

hongchaodeng commented May 18, 2017

xiang90 May 18, 2017

hongchaodeng May 18, 2017

xiang90 May 18, 2017

hongchaodeng May 18, 2017

xiang90 May 18, 2017

hongchaodeng May 18, 2017

rimusz commented May 19, 2017 •

edited

Loading

hongchaodeng commented May 19, 2017 •

edited

Loading

rimusz commented May 19, 2017

hongchaodeng commented May 19, 2017

xiang90 commented May 19, 2017

rimusz commented May 19, 2017

selfHosted: fix backup unable to talk to etcd pods #1108

selfHosted: fix backup unable to talk to etcd pods #1108

Conversation

hongchaodeng commented May 18, 2017 • edited Loading

xiang90 commented May 18, 2017

hongchaodeng commented May 18, 2017 • edited Loading

xiang90 commented May 18, 2017

rimusz commented May 18, 2017

hongchaodeng commented May 18, 2017

rimusz commented May 18, 2017

rimusz commented May 18, 2017

rimusz commented May 18, 2017 • edited Loading

hongchaodeng commented May 18, 2017

hongchaodeng commented May 18, 2017

hongchaodeng commented May 18, 2017

rimusz commented May 18, 2017

hongchaodeng commented May 18, 2017

xiang90 May 18, 2017

Choose a reason for hiding this comment

hongchaodeng May 18, 2017

Choose a reason for hiding this comment

xiang90 May 18, 2017

Choose a reason for hiding this comment

hongchaodeng May 18, 2017

Choose a reason for hiding this comment

xiang90 May 18, 2017

Choose a reason for hiding this comment

hongchaodeng May 18, 2017

Choose a reason for hiding this comment

rimusz commented May 19, 2017 • edited Loading

hongchaodeng commented May 19, 2017 • edited Loading

rimusz commented May 19, 2017

hongchaodeng commented May 19, 2017

xiang90 commented May 19, 2017

rimusz commented May 19, 2017

hongchaodeng commented May 18, 2017 •

edited

Loading

hongchaodeng commented May 18, 2017 •

edited

Loading

rimusz commented May 18, 2017 •

edited

Loading

rimusz commented May 19, 2017 •

edited

Loading

hongchaodeng commented May 19, 2017 •

edited

Loading