Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestRejectUnhealthyRemove: should reject quorum breaking remove #7609

Closed
heyitsanthony opened this issue Mar 27, 2017 · 4 comments
Closed

TestRejectUnhealthyRemove: should reject quorum breaking remove #7609

heyitsanthony opened this issue Mar 27, 2017 · 4 comments
Assignees

Comments

@heyitsanthony
Copy link
Contributor

heyitsanthony commented Mar 27, 2017

--- FAIL: TestRejectUnhealthyRemove (23.22s)
	cluster_test.go:418: should reject quorum breaking remove

Seen a few times recently on semaphore.

edits from gyuho

logs

2017-04-06 01:59:26.616618 E | etcdserver/api/v2http: error removing member 2cb0e92d6ad0b3c7 (etcdserver: request timed out)
2017-04-06 01:59:26.617161 E | etcdserver/api/v2http: got unexpected response error (etcdserver: request timed out)
2017-04-06 01:59:26.635774 I | raft: propose conf Type:EntryConfChange Data:"\010\211\210\212\344\217\350\226\202\373\001\020\001\030\307\347\302\326\326\245\272\330,"  ignored since pending unapplied configuration
2017-04-06 01:59:27.202906 W | etcdserver: failed to send out heartbeat on time (exceeded the 10ms timeout for 261.905µs)
2017-04-06 01:59:27.203317 W | etcdserver: server is likely overloaded

https://jenkins-etcd-public.prod.coreos.systems/job/etcd-proxy/1212/consoleFull

@gyuho
Copy link
Contributor

gyuho commented Mar 27, 2017

@heyitsanthony

Based on that it took 20+ seconds, reproducible with this diff

diff --git a/etcdserver/server.go b/etcdserver/server.go
index 70e14924..a62ae79a 100644
--- a/etcdserver/server.go
+++ b/etcdserver/server.go
@@ -1110,7 +1110,7 @@ func (s *EtcdServer) mayRemoveMember(id types.ID) error {

        // protect quorum if some members are down
        m := s.cluster.Members()
-       active := numConnectedSince(s.r.transport, time.Now().Add(-HealthInterval), s.ID(), m)
+       active := numConnectedSince(s.r.transport, time.Now().Add(-time.Nanosecond), s.ID(), m)
        if (active - 1) < 1+((len(m)-1)/2) {
                plog.Warningf("reconfigure breaks active quorum, rejecting remove member %s", id)
                return ErrUnhealthy

Which means, slow machine semaphore took more thanHealthInterval (5-second) before it starts the counting, now the active number of connection is 3 (which fails the boolean case (active - 1) < 1+((len(m)-1)/2).

Should we just time.Sleep(HealthInterval) before calling removeMember?

@heyitsanthony
Copy link
Contributor Author

Which means, slow machine semaphore took more thanHealthInterval (5-second) before it starts the counting, now the active number of connection is 3 (which passes the boolean case (active - 1) < 1+((len(m)-1)/2).

Except Launch won't return until the cluster and all members are ready so it should already be counting. What could be happening is the member hasn't detected that the stopped members have disconnected (e.g., run launch, slow cluster causes HealthInterval to elapse, all members look healthy), so it appears there are still enough nodes for quorum.

Would like to avoid unconditional sleeps if possible. I was hoping number of connected peers would be available from /metrics and the test could poll on that, but it's not there. Might be worth adding... /cc @xiang90

gyuho added a commit to gyuho/etcd that referenced this issue Apr 6, 2017
Fix etcd-io#7609.

Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>
@gyuho
Copy link
Contributor

gyuho commented Apr 19, 2017

@heyitsanthony @xiang90

Think this is fixed

Here we re-start members[0], https://github.com/coreos/etcd/blob/master/integration/cluster_test.go#L434

2017-04-06 01:59:03.411056 I | integration: launching 7781720184774009405 ()
2017-04-06 01:59:03.431687 I | etcdserver: name = 7781720184774009405
2017-04-06 01:59:03.431839 I | etcdserver: data dir = /tmp/etcd334647200
2017-04-06 01:59:03.431965 I | etcdserver: member dir = /tmp/etcd334647200/member
2017-04-06 01:59:03.432083 I | etcdserver: heartbeat = 10ms
2017-04-06 01:59:03.432183 I | etcdserver: election = 100ms
2017-04-06 01:59:03.432231 I | etcdserver: snapshot count = 0
2017-04-06 01:59:03.432338 I | etcdserver: advertise client URLs = unix://127.0.0.1:2268028587
2017-04-06 01:59:03.434739 I | etcdserver: restarting member 2cb0e92d6ad0b3c7 in cluster 7c2e5ed25e8b5f08 at commit index 12
2017-04-06 01:59:03.434966 I | raft: 2cb0e92d6ad0b3c7 became follower at term 2
2017-04-06 01:59:03.435056 I | raft: newRaft 2cb0e92d6ad0b3c7 [peers: [], term: 2, commit: 12, applied: 0, lastindex: 12, lastterm: 2]
2017-04-06 01:59:03.445238 W | auth: simple token is not cryptographically signed
2017-04-06 01:59:03.450042 I | etcdserver: set snapshot count to default 100000
2017-04-06 01:59:03.450179 I | etcdserver: starting server... [version: 3.2.0+git, cluster version: to_be_decided]
2017-04-06 01:59:03.455319 I | integration: launched 7781720184774009405 ()
2017-04-06 01:59:03.455438 I | integration: restarted 7781720184774009405 ()
// bring cluster to (4,1)
c.Members[0].Restart(t)

And I found out the following logs show the same pattern

2017-04-06 01:59:03.464103 I | raft: 2cb0e92d6ad0b3c7 is starting a new election at term 2
2017-04-06 01:59:03.464503 I | raft: 2cb0e92d6ad0b3c7 became candidate at term 3
2017-04-06 01:59:03.464717 I | raft: 2cb0e92d6ad0b3c7 received MsgVoteResp from 2cb0e92d6ad0b3c7 at term 3
2017-04-06 01:59:03.464902 I | raft: 2cb0e92d6ad0b3c7 became leader at term 3

2cb0e92d6ad0b3c7 restarted, did not wait for config change apply, and becomes the leader in the single-node cluster. Basically same issue as #7595 and fixed by #7706.

Logs can be found at https://jenkins-etcd-public.prod.coreos.systems/job/etcd-proxy/1212/consoleFull.

@heyitsanthony
Copy link
Contributor Author

OK can reopen if it turns out that timing problem turns out to be causing it to break as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

2 participants