Add functional test coverage for Agent's loss of etcd connectivity #715

bcwaldon · 2014-07-27T16:06:38Z

Related #708

We need to add a functional test that ensures that the fleet agent behaves correctly when it loses access to the etcd cluster.

wuqixuan · 2015-09-15T07:37:31Z

+1

themicster · 2016-01-26T19:01:08Z

+1, So it turns out in order to simulate etcd being down as in a case where the etcd machine is dead you have to ignore packets to or from the IP that etcd should be running on. In normal behavior of an up machine you'll receive an ICMP that tells you that the connection is refused from the machine because it is not listening on that port. This results in the client instantly trying the next etcd server, which is great and there is no problem.
But in the case of the machine being dead there is no ICMP packet sent back and the client will wait for normally a long timeout period before trying the next etcd server. Which is where I figured the HeaderTimeoutPerRequest would come in handy. But I agree a test should be created to simulate this condition of non-existent machine in the etcd list.
So, in order to do that properly I have found that you cannot simply stop etcd on a server. You have to drop the packets to/from that server to simulate the whole machine being offline. Or you can put an invalid IP address in the list of etcd machines, but I would think this would be much more difficult, but maybe not. Also keep in mind there is a new pinning system that will pin good servers, so you will need to make sure that it is not just skipping the server altogether. I believe it retries all the servers every 10 seconds or so and then decides which one to pin.

I'm not sure how your test environment is setup, but in my case I have a cumulus switch I can put an iptables rule in to prevent traffic to/from a server. I think you could also simulate this without loosing complete connectivity to the machine by just dropping all ICMP, or find the specific icmp type that gets sent for "connection refused" and block that. Be sure you understand the actual packet flow here because it's been a while since I've read TCP/IP Illustrated, I could be missing something. But I hope this helps.

antrik · 2016-02-26T15:27:55Z

So I think this needs some clarification on what exactly we want to test here.

First of all, AIUI this is about total loss of connectivity to the entire etcd cluster -- failover handling is an entirely different issue, right? As the functional test currently all work with a single etcd instance, this should be more or less straightforward I guess...

Next question is about what exactly "loss of connectivity" means. Should we test the clear-cut case when we get an immediate error response, e.g. when the etcd process went down? Or the case where we don't get replies (but eventually a timeout I guess) when the server process or the network just hangs? Or both?

Also, what exactly do we want to test when connectivity is lost? Just that the state of existing units doesn't change on the disconnected node, or also how the rest of the cluster behaves? And if it's the latter, what do we actually expect there?... What about reconciliation once connectivity is re-established?

themicster · 2016-03-01T04:58:04Z

So after reviewing #708 again, I think it can be made even simpler for a phase one set of tests. Some tests need to be created that can simulate flapping etcd in two ways, etcd service on a machine is starting and stopping and the machine etcd is running on is loosing connectivity. But because of the pinning behavior, the test may fail sometimes and succeed other times with no change to infrastructure.

jonboulle · 2016-03-03T14:15:51Z

First of all, AIUI this is about total loss of connectivity to the entire etcd cluster -- failover handling is an entirely different issue, right?

Yes

Next question is about what exactly "loss of connectivity" means. Should we test the clear-cut case when we get an immediate error response, e.g. when the etcd process went down?

Yes

Or the case where we don't get replies (but eventually a timeout I guess) when the server process or the network just hangs?

This would be nice to have but isn't an immediate requirement. Let's start with loss and go from there.

Also, what exactly do we want to test when connectivity is lost? Just that the state of existing units doesn't change on the disconnected node, or also how the rest of the cluster behaves?

Only the disconnected node.

bcwaldon mentioned this issue Jul 27, 2014

Loss of Agent behavior incorrect #708

Closed

bcwaldon added the refactor label Jul 30, 2014

bcwaldon added the help wanted label Dec 18, 2014

jonboulle added this to the v0.12.0 milestone Jan 19, 2016

jonboulle mentioned this issue Jan 26, 2016

Need way to set HeaderTimeoutPerRequest #1397

Closed

kayrus mentioned this issue Feb 26, 2016

panic due to double-close of channel #1067

Closed

antrik mentioned this issue Mar 11, 2016

Add test for behaviour on etcd connectivity loss #1501

Merged

jonboulle closed this as completed in #1501 Apr 1, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add functional test coverage for Agent's loss of etcd connectivity #715

Add functional test coverage for Agent's loss of etcd connectivity #715

bcwaldon commented Jul 27, 2014

wuqixuan commented Sep 15, 2015

themicster commented Jan 26, 2016

antrik commented Feb 26, 2016

themicster commented Mar 1, 2016

jonboulle commented Mar 3, 2016

Add functional test coverage for Agent's loss of etcd connectivity #715

Add functional test coverage for Agent's loss of etcd connectivity #715

Comments

bcwaldon commented Jul 27, 2014

wuqixuan commented Sep 15, 2015

themicster commented Jan 26, 2016

antrik commented Feb 26, 2016

themicster commented Mar 1, 2016

jonboulle commented Mar 3, 2016