Skip to content
This repository has been archived by the owner on Mar 28, 2020. It is now read-only.

selfHosted: fix backup unable to talk to etcd pods #1108

Merged
merged 1 commit into from
May 19, 2017

Conversation

hongchaodeng
Copy link
Member

@hongchaodeng hongchaodeng commented May 18, 2017

Currently Backup Sidecar is not able to talk to etcd pods because
it is using the FQDN which isn't supported in self hosted yet.
For self-hosted case, we are going to use PodIP.

fix #1107

@xiang90
Copy link
Collaborator

xiang90 commented May 18, 2017

Can we add a test for self hosted etcd + backup?

@hongchaodeng
Copy link
Member Author

hongchaodeng commented May 18, 2017

Right. I am going to if the solution is right.
This is an early submission about potential solution.

@xiang90
Copy link
Collaborator

xiang90 commented May 18, 2017

@rimusz it would be great if you can give this patch a try.

@rimusz
Copy link

rimusz commented May 18, 2017

I can try it.
is there an easy way to build etcd-operator binary?

@hongchaodeng
Copy link
Member Author

@rimusz
Copy link

rimusz commented May 18, 2017

screenshot 2017-05-18 19 34 44

@hongchaodeng fix is working

@rimusz
Copy link

rimusz commented May 18, 2017

ok, it only works on vagrant cluster, tested again on StackPointCloud cluster, still the same grpc issue:

kube-etcd-backup-sidecar-1736258304-scj2c backup time="2017-05-18T19:00:49Z" level=warning msg="failed to create etcd client for pod (kube-etcd-0000): grpc: timed out when dialing"
kube-etcd-backup-sidecar-1736258304-scj2c backup time="2017-05-18T19:00:54Z" level=warning msg="failed to create etcd client for pod (kube-etcd-0001): grpc: timed out when dialing"
kube-etcd-backup-sidecar-1736258304-scj2c backup time="2017-05-18T19:00:59Z" level=warning msg="failed to create etcd client for pod (kube-etcd-0002): grpc: timed out when dialing"
kube-etcd-backup-sidecar-1736258304-scj2c backup time="2017-05-18T19:00:59Z" level=warning msg="no reachable member"
kube-etcd-backup-sidecar-1736258304-scj2c backup time="2017-05-18T19:00:59Z" level=error msg="failed to save snapshot: no reachable member"

@rimusz
Copy link

rimusz commented May 18, 2017

could be related to operator update from 0.2.4->0.2.6->0.3.1(with the fix) ?

@hongchaodeng
Copy link
Member Author

This potential fix as shown here isn't backwards compatible.

@hongchaodeng hongchaodeng changed the title selfHosted: fix service unreachability selfHosted: fix backup unable to talk to etcd pods May 18, 2017
@hongchaodeng
Copy link
Member Author

Change the method to be upgrade compatible.

@hongchaodeng
Copy link
Member Author

@rimusz
Can you try again with latest fix?

@rimusz
Copy link

rimusz commented May 18, 2017

sure, will give it a go tomorrow morning

@hongchaodeng
Copy link
Member Author

@xiang90
Test added

t.Run("migrate boot member to self hosted cluster", testCreateSelfHostedClusterWithBootMember)
t.Run("create self hosted cluster from scratch", testCreateSelfHostedCluster)
t.Run("migrate boot member to self hosted cluster", testCreateSelfHostedClusterWithBootMember)
t.Run("backup for self hosted cluster", func(t *testing.T) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this a separate test?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to clean up after all self hosted test

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason we cannot separate it out now? i do not want to make it worse.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a cleanup function at the end of the outer layer test.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just duplicate it for now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cleanup function is not simple.
It sets up pods on each node, runs commands, waits them to complete. It's not worth to do it again just for code structure.

@rimusz
Copy link

rimusz commented May 19, 2017

ok, tested it with different backup/restore scenarios for in-cluster etcd cluster, everything works just fine.
But for self-hosted etcd (for k8s API) via bootkube, backups work now, but if I delete 2 etcd pods, cluster never recovers. so restore part is still buggy.

also noticed if I update cluster manifest with the backup options operator deletes all etcd pods,
if operator gets updated with the backup features, then cluster, all is fine, backups start working.

etcd-operator some logs:

$ docker logs -f fabce0bcc1e8
time="2017-05-19T14:05:25Z" level=info msg="etcd-operator Version: 0.3.0+git"
time="2017-05-19T14:05:25Z" level=info msg="Git SHA: 369d01c"
time="2017-05-19T14:05:25Z" level=info msg="Go Version: go1.8.1"
time="2017-05-19T14:05:25Z" level=info msg="Go OS/Arch: linux/amd64"
E0519 14:05:32.686095       1 election.go:259] Failed to update lock: etcdserver: request timed out
E0519 14:05:44.562267       1 election.go:259] Failed to update lock: etcdserver: request timed out

last remaining etcd pod logs:

2017-05-19 14:30:01.080147 W | etcdserver: timed out waiting for read index response
2017-05-19 14:30:01.556119 I | raft: 16f4bc0651f13a46 is starting a new election at term 1070
2017-05-19 14:30:01.556253 I | raft: 16f4bc0651f13a46 became candidate at term 1071
2017-05-19 14:30:01.556280 I | raft: 16f4bc0651f13a46 received MsgVoteResp from 16f4bc0651f13a46 at term 1071
2017-05-19 14:30:01.556305 I | raft: 16f4bc0651f13a46 [logterm: 27, index: 15684] sent MsgVote request to a724aeed47064c66 at term 1071
2017-05-19 14:30:01.556326 I | raft: 16f4bc0651f13a46 [logterm: 27, index: 15684] sent MsgVote request to 7cacbdfa5be01295 at term 1071
2017-05-19 14:30:03.156039 I | raft: 16f4bc0651f13a46 is starting a new election at term 1071
2017-05-19 14:30:03.156073 I | raft: 16f4bc0651f13a46 became candidate at term 1072
2017-05-19 14:30:03.156084 I | raft: 16f4bc0651f13a46 received MsgVoteResp from 16f4bc0651f13a46 at term 1072
2017-05-19 14:30:03.156093 I | raft: 16f4bc0651f13a46 [logterm: 27, index: 15684] sent MsgVote request to a724aeed47064c66 at term 1072
2017-05-19 14:30:03.156102 I | raft: 16f4bc0651f13a46 [logterm: 27, index: 15684] sent MsgVote request to 7cacbdfa5be01295 at term 1072
2017-05-19 14:30:05.056077 I | raft: 16f4bc0651f13a46 is starting a new election at term 1072
2017-05-19 14:30:05.056119 I | raft: 16f4bc0651f13a46 became candidate at term 1073
2017-05-19 14:30:05.056132 I | raft: 16f4bc0651f13a46 received MsgVoteResp from 16f4bc0651f13a46 at term 1073
2017-05-19 14:30:05.056142 I | raft: 16f4bc0651f13a46 [logterm: 27, index: 15684] sent MsgVote request to 7cacbdfa5be01295 at term 1073
2017-05-19 14:30:05.056174 I | raft: 16f4bc0651f13a46 [logterm: 27, index: 15684] sent MsgVote request to a724aeed47064c66 at term 1073
2017-05-19 14:30:05.334402 W | rafthttp: health check for peer a724aeed47064c66 could not connect: dial tcp 172.23.5.59:2380: getsockopt: connection refused
2017-05-19 14:30:05.861240 W | rafthttp: health check for peer 7cacbdfa5be01295 could not connect: dial tcp 172.23.5.208:2380: getsockopt: connection refused
2017-05-19 14:30:06.356034 I | raft: 16f4bc0651f13a46 is starting a new election at term 1073
2017-05-19 14:30:06.356067 I | raft: 16f4bc0651f13a46 became candidate at term 1074
2017-05-19 14:30:06.356077 I | raft: 16f4bc0651f13a46 received MsgVoteResp from 16f4bc0651f13a46 at term 1074
2017-05-19 14:30:06.356087 I | raft: 16f4bc0651f13a46 [logterm: 27, index: 15684] sent MsgVote request to 7cacbdfa5be01295 at term 1074
2017-05-19 14:30:06.356095 I | raft: 16f4bc0651f13a46 [logterm: 27, index: 15684] sent MsgVote request to a724aeed47064c66 at term 1074
2017-05-19 14:30:07.356045 I | raft: 16f4bc0651f13a46 is starting a new election at term 1074
2017-05-19 14:30:07.356075 I | raft: 16f4bc0651f13a46 became candidate at term 1075
2017-05-19 14:30:07.356085 I | raft: 16f4bc0651f13a46 received MsgVoteResp from 16f4bc0651f13a46 at term 1075
2017-05-19 14:30:07.356094 I | raft: 16f4bc0651f13a46 [logterm: 27, index: 15684] sent MsgVote request to 7cacbdfa5be01295 at term 1075
2017-05-19 14:30:07.356102 I | raft: 16f4bc0651f13a46 [logterm: 27, index: 15684] sent MsgVote request to a724aeed47064c66 at term 1075
2017-05-19 14:30:08.080374 W | etcdserver: timed out waiting for read index response
2017-05-19 14:30:08.656020 I | raft: 16f4bc0651f13a46 is starting a new election at term 1075
2017-05-19 14:30:08.656052 I | raft: 16f4bc0651f13a46 became candidate at term 1076
2017-05-19 14:30:08.656061 I | raft: 16f4bc0651f13a46 received MsgVoteResp from 16f4bc0651f13a46 at term 1076
2017-05-19 14:30:08.656070 I | raft: 16f4bc0651f13a46 [logterm: 27, index: 15684] sent MsgVote request to 7cacbdfa5be01295 at term 1076
2017-05-19 14:30:08.656078 I | raft: 16f4bc0651f13a46 [logterm: 27, index: 15684] sent MsgVote request to a724aeed47064c66 at term 1076
2017-05-19 14:30:09.855966 I | raft: 16f4bc0651f13a46 is starting a new election at term 1076

@hongchaodeng
Copy link
Member Author

hongchaodeng commented May 19, 2017

@rimusz
For this PR, it fixes the issue #1107 . In other words, the sidecar couldn't make any backup.

For bootkube, if you deleted or killed majority nodes (or etcd pods), you have to follow bootkube's instructions to recover a downed cluster: https://github.com/kubernetes-incubator/bootkube/#recover-a-downed-cluster . If you are interested, please ask in bootkube repo.

@rimusz
Copy link

rimusz commented May 19, 2017

@hongchaodeng I did not kill nodes, I killed etcd pods for testing.
And yes this PR, it fixes the issue #1107 so LGTM

@hongchaodeng
Copy link
Member Author

Yeah. We usually refer to nodes because we spread etcd pods across nodes.
Cool. Good to hear that it fixes your issue.

@xiang90
Copy link
Collaborator

xiang90 commented May 19, 2017

lgtm

@rimusz
Copy link

rimusz commented May 19, 2017

cool, let's get it merged then :)

Currently Backup Sidecar is not able to talk to etcd pods because
it is using the FQDN which isn't supported in self hosted yet.
For self-hosted case, we are going to use PodIP.
@hongchaodeng hongchaodeng merged commit e7eccd9 into coreos:master May 19, 2017
@hongchaodeng hongchaodeng deleted the f branch May 19, 2017 17:40
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

etcd-backup-sidecar does not work with bootkube selfhosted etcd
3 participants