Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcd dial timeout too fast when new member can't reach to the existing cluster #8532

Closed
fanminshi opened this issue Sep 8, 2017 · 1 comment · Fixed by #8599
Closed

etcd dial timeout too fast when new member can't reach to the existing cluster #8532

fanminshi opened this issue Sep 8, 2017 · 1 comment · Fixed by #8599
Assignees

Comments

@fanminshi
Copy link
Member

etcd version: 3.1.8
via coreos/etcd-operator#1300

When adding a new member to existing cluster, the new member needs to get cluster information from its peer.

existingCluster, gerr := GetClusterFromRemotePeers(getRemotePeerURLs(cl, cfg.Name), prt)
if gerr != nil {
  return nil, fmt.Errorf("cannot fetch cluster info from peer urls: %v", gerr)
}

https://github.com/coreos/etcd/blob/master/etcdserver/server.go#L312-L315

If peer can't be timely reached, the new member dial timeout too quickly in around ~ 1s. The reason for this is because the round tripper prt used has cfg.peerDialTimeout() configured with a small timeout.

func (c *ServerConfig) peerDialTimeout() time.Duration {
	// 1s for queue wait and system delay
	// + one RTT, which is smaller than 1/5 election timeout
	return time.Second + time.Duration(c.ElectionTicks)*time.Duration(c.TickMs)*time.Millisecond/5
}

https://github.com/coreos/etcd/blob/master/etcdserver/config.go#L183-L187

prt, err := rafthttp.NewRoundTripper(cfg.PeerTLSInfo, cfg.peerDialTimeout())

https://github.com/coreos/etcd/blob/master/etcdserver/server.go#L294

this dial timeout seems a bit short for bootstrapping an new etcd member. In the case of etcd-operator, new member might not able reach to existing cluster due to network issue. It will be the best if dial timeout can be configurable to help debugging the underlying network issue.

@BlueBlue-Lee
Copy link
Contributor

Agree. One second is too fast when bootstrap a new member.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.

2 participants