TestRejectUnhealthyRemove: should reject quorum breaking remove #7609

heyitsanthony · 2017-03-27T16:36:43Z

--- FAIL: TestRejectUnhealthyRemove (23.22s)
	cluster_test.go:418: should reject quorum breaking remove

Seen a few times recently on semaphore.

edits from gyuho

logs

2017-04-06 01:59:26.616618 E | etcdserver/api/v2http: error removing member 2cb0e92d6ad0b3c7 (etcdserver: request timed out)
2017-04-06 01:59:26.617161 E | etcdserver/api/v2http: got unexpected response error (etcdserver: request timed out)
2017-04-06 01:59:26.635774 I | raft: propose conf Type:EntryConfChange Data:"\010\211\210\212\344\217\350\226\202\373\001\020\001\030\307\347\302\326\326\245\272\330,"  ignored since pending unapplied configuration
2017-04-06 01:59:27.202906 W | etcdserver: failed to send out heartbeat on time (exceeded the 10ms timeout for 261.905µs)
2017-04-06 01:59:27.203317 W | etcdserver: server is likely overloaded

https://jenkins-etcd-public.prod.coreos.systems/job/etcd-proxy/1212/consoleFull

The text was updated successfully, but these errors were encountered:

gyuho · 2017-03-27T23:00:36Z

@heyitsanthony

Based on that it took 20+ seconds, reproducible with this diff

diff --git a/etcdserver/server.go b/etcdserver/server.go
index 70e14924..a62ae79a 100644
--- a/etcdserver/server.go
+++ b/etcdserver/server.go
@@ -1110,7 +1110,7 @@ func (s *EtcdServer) mayRemoveMember(id types.ID) error {

        // protect quorum if some members are down
        m := s.cluster.Members()
-       active := numConnectedSince(s.r.transport, time.Now().Add(-HealthInterval), s.ID(), m)
+       active := numConnectedSince(s.r.transport, time.Now().Add(-time.Nanosecond), s.ID(), m)
        if (active - 1) < 1+((len(m)-1)/2) {
                plog.Warningf("reconfigure breaks active quorum, rejecting remove member %s", id)
                return ErrUnhealthy

Which means, slow machine semaphore took more thanHealthInterval (5-second) before it starts the counting, now the active number of connection is 3 (which fails the boolean case (active - 1) < 1+((len(m)-1)/2).

Should we just time.Sleep(HealthInterval) before calling removeMember?

heyitsanthony · 2017-03-27T23:28:43Z

Which means, slow machine semaphore took more thanHealthInterval (5-second) before it starts the counting, now the active number of connection is 3 (which passes the boolean case (active - 1) < 1+((len(m)-1)/2).

Except Launch won't return until the cluster and all members are ready so it should already be counting. What could be happening is the member hasn't detected that the stopped members have disconnected (e.g., run launch, slow cluster causes HealthInterval to elapse, all members look healthy), so it appears there are still enough nodes for quorum.

Would like to avoid unconditional sleeps if possible. I was hoping number of connected peers would be available from /metrics and the test could poll on that, but it's not there. Might be worth adding... /cc @xiang90

Fix etcd-io#7609. Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>

gyuho · 2017-04-19T21:26:29Z

@heyitsanthony @xiang90

Think this is fixed

Here we re-start members[0], https://github.com/coreos/etcd/blob/master/integration/cluster_test.go#L434

2017-04-06 01:59:03.411056 I | integration: launching 7781720184774009405 ()
2017-04-06 01:59:03.431687 I | etcdserver: name = 7781720184774009405
2017-04-06 01:59:03.431839 I | etcdserver: data dir = /tmp/etcd334647200
2017-04-06 01:59:03.431965 I | etcdserver: member dir = /tmp/etcd334647200/member
2017-04-06 01:59:03.432083 I | etcdserver: heartbeat = 10ms
2017-04-06 01:59:03.432183 I | etcdserver: election = 100ms
2017-04-06 01:59:03.432231 I | etcdserver: snapshot count = 0
2017-04-06 01:59:03.432338 I | etcdserver: advertise client URLs = unix://127.0.0.1:2268028587
2017-04-06 01:59:03.434739 I | etcdserver: restarting member 2cb0e92d6ad0b3c7 in cluster 7c2e5ed25e8b5f08 at commit index 12
2017-04-06 01:59:03.434966 I | raft: 2cb0e92d6ad0b3c7 became follower at term 2
2017-04-06 01:59:03.435056 I | raft: newRaft 2cb0e92d6ad0b3c7 [peers: [], term: 2, commit: 12, applied: 0, lastindex: 12, lastterm: 2]
2017-04-06 01:59:03.445238 W | auth: simple token is not cryptographically signed
2017-04-06 01:59:03.450042 I | etcdserver: set snapshot count to default 100000
2017-04-06 01:59:03.450179 I | etcdserver: starting server... [version: 3.2.0+git, cluster version: to_be_decided]
2017-04-06 01:59:03.455319 I | integration: launched 7781720184774009405 ()
2017-04-06 01:59:03.455438 I | integration: restarted 7781720184774009405 ()

// bring cluster to (4,1)
c.Members[0].Restart(t)

And I found out the following logs show the same pattern

2017-04-06 01:59:03.464103 I | raft: 2cb0e92d6ad0b3c7 is starting a new election at term 2
2017-04-06 01:59:03.464503 I | raft: 2cb0e92d6ad0b3c7 became candidate at term 3
2017-04-06 01:59:03.464717 I | raft: 2cb0e92d6ad0b3c7 received MsgVoteResp from 2cb0e92d6ad0b3c7 at term 3
2017-04-06 01:59:03.464902 I | raft: 2cb0e92d6ad0b3c7 became leader at term 3

2cb0e92d6ad0b3c7 restarted, did not wait for config change apply, and becomes the leader in the single-node cluster. Basically same issue as #7595 and fixed by #7706.

Logs can be found at https://jenkins-etcd-public.prod.coreos.systems/job/etcd-proxy/1212/consoleFull.

heyitsanthony · 2017-04-19T21:44:32Z

OK can reopen if it turns out that timing problem turns out to be causing it to break as well.

heyitsanthony added the area/testing label Mar 27, 2017

heyitsanthony assigned gyuho Mar 28, 2017

This was referenced Apr 3, 2017

Election RPC service #7634

Merged

test: generate coverage report even if some tests fail #7661

Merged

gyuho added a commit to gyuho/etcd that referenced this issue Apr 6, 2017

integration: wait 'TestRejectUnhealthyRemove' until disconnection

c194ea1

Fix etcd-io#7609. Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>

gyuho mentioned this issue Apr 6, 2017

rafthttp: add peer stream metrics #7682

Closed

heyitsanthony closed this as completed Apr 19, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TestRejectUnhealthyRemove: should reject quorum breaking remove #7609

TestRejectUnhealthyRemove: should reject quorum breaking remove #7609

heyitsanthony commented Mar 27, 2017 •

edited by gyuho

Loading

gyuho commented Mar 27, 2017 •

edited

Loading

heyitsanthony commented Mar 27, 2017

gyuho commented Apr 19, 2017 •

edited

Loading

heyitsanthony commented Apr 19, 2017

TestRejectUnhealthyRemove: should reject quorum breaking remove #7609

TestRejectUnhealthyRemove: should reject quorum breaking remove #7609

Comments

heyitsanthony commented Mar 27, 2017 • edited by gyuho Loading

gyuho commented Mar 27, 2017 • edited Loading

heyitsanthony commented Mar 27, 2017

gyuho commented Apr 19, 2017 • edited Loading

heyitsanthony commented Apr 19, 2017

heyitsanthony commented Mar 27, 2017 •

edited by gyuho

Loading

gyuho commented Mar 27, 2017 •

edited

Loading

gyuho commented Apr 19, 2017 •

edited

Loading