Skip to content
This repository has been archived by the owner on Jan 30, 2020. It is now read-only.

Add functional test coverage for Agent's loss of etcd connectivity #715

Closed
bcwaldon opened this issue Jul 27, 2014 · 5 comments
Closed

Add functional test coverage for Agent's loss of etcd connectivity #715

bcwaldon opened this issue Jul 27, 2014 · 5 comments

Comments

@bcwaldon
Copy link
Contributor

Related #708

We need to add a functional test that ensures that the fleet agent behaves correctly when it loses access to the etcd cluster.

@wuqixuan
Copy link
Contributor

+1

@themicster
Copy link

+1, So it turns out in order to simulate etcd being down as in a case where the etcd machine is dead you have to ignore packets to or from the IP that etcd should be running on. In normal behavior of an up machine you'll receive an ICMP that tells you that the connection is refused from the machine because it is not listening on that port. This results in the client instantly trying the next etcd server, which is great and there is no problem.
But in the case of the machine being dead there is no ICMP packet sent back and the client will wait for normally a long timeout period before trying the next etcd server. Which is where I figured the HeaderTimeoutPerRequest would come in handy. But I agree a test should be created to simulate this condition of non-existent machine in the etcd list.
So, in order to do that properly I have found that you cannot simply stop etcd on a server. You have to drop the packets to/from that server to simulate the whole machine being offline. Or you can put an invalid IP address in the list of etcd machines, but I would think this would be much more difficult, but maybe not. Also keep in mind there is a new pinning system that will pin good servers, so you will need to make sure that it is not just skipping the server altogether. I believe it retries all the servers every 10 seconds or so and then decides which one to pin.

I'm not sure how your test environment is setup, but in my case I have a cumulus switch I can put an iptables rule in to prevent traffic to/from a server. I think you could also simulate this without loosing complete connectivity to the machine by just dropping all ICMP, or find the specific icmp type that gets sent for "connection refused" and block that. Be sure you understand the actual packet flow here because it's been a while since I've read TCP/IP Illustrated, I could be missing something. But I hope this helps.

@antrik
Copy link
Contributor

antrik commented Feb 26, 2016

So I think this needs some clarification on what exactly we want to test here.

First of all, AIUI this is about total loss of connectivity to the entire etcd cluster -- failover handling is an entirely different issue, right? As the functional test currently all work with a single etcd instance, this should be more or less straightforward I guess...

Next question is about what exactly "loss of connectivity" means. Should we test the clear-cut case when we get an immediate error response, e.g. when the etcd process went down? Or the case where we don't get replies (but eventually a timeout I guess) when the server process or the network just hangs? Or both?

Also, what exactly do we want to test when connectivity is lost? Just that the state of existing units doesn't change on the disconnected node, or also how the rest of the cluster behaves? And if it's the latter, what do we actually expect there?... What about reconciliation once connectivity is re-established?

@themicster
Copy link

So after reviewing #708 again, I think it can be made even simpler for a phase one set of tests. Some tests need to be created that can simulate flapping etcd in two ways, etcd service on a machine is starting and stopping and the machine etcd is running on is loosing connectivity. But because of the pinning behavior, the test may fail sometimes and succeed other times with no change to infrastructure.

@jonboulle
Copy link
Contributor

First of all, AIUI this is about total loss of connectivity to the entire etcd cluster -- failover handling is an entirely different issue, right?

Yes

Next question is about what exactly "loss of connectivity" means. Should we test the clear-cut case when we get an immediate error response, e.g. when the etcd process went down?

Yes

Or the case where we don't get replies (but eventually a timeout I guess) when the server process or the network just hangs?

This would be nice to have but isn't an immediate requirement. Let's start with loss and go from there.

Also, what exactly do we want to test when connectivity is lost? Just that the state of existing units doesn't change on the disconnected node, or also how the rest of the cluster behaves?

Only the disconnected node.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants