-
Notifications
You must be signed in to change notification settings - Fork 302
Add functional test coverage for Agent's loss of etcd connectivity #715
Comments
+1 |
+1, So it turns out in order to simulate etcd being down as in a case where the etcd machine is dead you have to ignore packets to or from the IP that etcd should be running on. In normal behavior of an up machine you'll receive an ICMP that tells you that the connection is refused from the machine because it is not listening on that port. This results in the client instantly trying the next etcd server, which is great and there is no problem. I'm not sure how your test environment is setup, but in my case I have a cumulus switch I can put an iptables rule in to prevent traffic to/from a server. I think you could also simulate this without loosing complete connectivity to the machine by just dropping all ICMP, or find the specific icmp type that gets sent for "connection refused" and block that. Be sure you understand the actual packet flow here because it's been a while since I've read TCP/IP Illustrated, I could be missing something. But I hope this helps. |
So I think this needs some clarification on what exactly we want to test here. First of all, AIUI this is about total loss of connectivity to the entire etcd cluster -- failover handling is an entirely different issue, right? As the functional test currently all work with a single etcd instance, this should be more or less straightforward I guess... Next question is about what exactly "loss of connectivity" means. Should we test the clear-cut case when we get an immediate error response, e.g. when the etcd process went down? Or the case where we don't get replies (but eventually a timeout I guess) when the server process or the network just hangs? Or both? Also, what exactly do we want to test when connectivity is lost? Just that the state of existing units doesn't change on the disconnected node, or also how the rest of the cluster behaves? And if it's the latter, what do we actually expect there?... What about reconciliation once connectivity is re-established? |
So after reviewing #708 again, I think it can be made even simpler for a phase one set of tests. Some tests need to be created that can simulate flapping etcd in two ways, etcd service on a machine is starting and stopping and the machine etcd is running on is loosing connectivity. But because of the pinning behavior, the test may fail sometimes and succeed other times with no change to infrastructure. |
Yes
Yes
This would be nice to have but isn't an immediate requirement. Let's start with loss and go from there.
Only the disconnected node. |
Related #708
We need to add a functional test that ensures that the fleet agent behaves correctly when it loses access to the etcd cluster.
The text was updated successfully, but these errors were encountered: