Skip to content
This repository has been archived by the owner on May 14, 2024. It is now read-only.

can't create cluster over localhost:7777 tunneled connection #16

Open
glycerine opened this issue Jan 23, 2017 · 5 comments
Open

can't create cluster over localhost:7777 tunneled connection #16

glycerine opened this issue Jan 23, 2017 · 5 comments

Comments

@glycerine
Copy link
Contributor

I think the "peer already known" logic needs to take into account the port as well as the host; or perhaps it just needs to treat localhost specially. I setup an ssh tunnel (using ssh -L 7777:localhost:7481 remotehost) between machines in EC2 to run some benchmarks, but I can't seem to make a cluster over the tunnel:

$  summitdb -join localhost:7777
24510:M 23 Jan 06:17:45.894 * summitdb 0.3.2
24510:N 23 Jan 06:17:45.897 * Node at :7481 [Follower] entering Follower state (Leader: "")
24510:N 23 Jan 06:17:45.898 # failed to join node at localhost:7777: peer already known
$ 

hmm... actually, upon further investigation, this errors seems to be coming from the vendored raft here: https://github.com/tidwall/summitdb/blob/master/vendor/github.com/hashicorp/raft/raft.go#L1101

I will continue to investigate. Ideas about how to approach this and workaround thoughts welcome.

@glycerine
Copy link
Contributor Author

I tried giving the 2nd peer -p 7480 to start it on a different port. Better, but still no luck:

106692:N 23 Jan 06:31:52.857 # Election timeout reached, restarting election
106692:N 23 Jan 06:31:52.857 * Node at :7481 [Candidate] entering Candidate state
106692:N 23 Jan 06:31:52.858 # Failed to make RequestVote RPC to :7480: dial tcp :7480: get\
sockopt: connection refused
106692:N 23 Jan 06:31:54.124 # Election timeout reached, restarting election
106692:N 23 Jan 06:31:54.124 * Node at :7481 [Candidate] entering Candidate state
106692:N 23 Jan 06:31:54.125 # Failed to make RequestVote RPC to :7480: dial tcp :7480: get\
sockopt: connection refused
... more of the same...

The first peer seems to want to dial via tcp directly, rather than re-using the existing (tunnelled) connection to the 7480 peer.

@glycerine
Copy link
Contributor Author

interestingly, even removing the 2nd peer does not work, and no leader is elected from the one viable node:

127.0.0.1:7481> raftremovepeer ":7480"
(error) ERR leader not known
127.0.0.1:7481> 

1st node continues to say:

06692:N 23 Jan 06:54:41.831 # Failed to make RequestVote RPC to :7480: dial tcp :7480: getsockopt: connection refused
106692:N 23 Jan 06:54:43.429 # Election timeout reached, restarting election
106692:N 23 Jan 06:54:43.429 * Node at :7481 [Candidate] entering Candidate state
106692:N 23 Jan 06:54:43.431 # Failed to make RequestVote RPC to :7480: dial tcp :7480: getsockopt: connection refused
106692:N 23 Jan 06:54:45.124 # Election timeout reached, restarting election
106692:N 23 Jan 06:54:45.125 * Node at :7481 [Candidate] entering Candidate state
106692:N 23 Jan 06:54:45.126 # Failed to make RequestVote RPC to :7480: dial tcp :7480: getsockopt: connection refused

I would prefer that "raftremovepeer" be a bit more aggressive here, so as to restore the cluster to a functioning state.

@glycerine
Copy link
Contributor Author

(I do realize this is all the underlying raft implementation, and little to do with summitdb proper.)

@tidwall
Copy link
Owner

tidwall commented Jan 24, 2017

I haven't played to much with ssh tunneling over raft, so I'm trying to catch up. I'll have to investigate further to fully wrap my head around it.

Regarding the raft implementation, as I understand all the peers must be able to reach each other using the same host:port combination. Would it help to create entries in the hosts file to alias localhost?

@glycerine
Copy link
Contributor Author

glycerine commented Jan 25, 2017

I didn't set up symmetric tunnels, so it's my bad.

I'm sure it simplifies the raft code to assume full peer-to-peer connectivity, both acting as client and both acting as "server".

It does end up simulating split-brain pretty well though. I wonder why hashicorp raft has such a difficult time recovering from it. Might be because I never got to 3 nodes, only 1 and then 1.5

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants