-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vault suffering complete freeze after connection issues with the backend #3896
Comments
Ping to see if @michaelansel or @adamdecaf or @vespian can help or provide guidance here. |
Please have a look/test samuel/go-zookeeper#181 with samuel/go-zookeeper#186 applied. CC: @mhrabovcin |
Adding to milestone to pull deps if it works as a fix. |
Thanks! I will try that on Monday and come back to you.
Le ven. 2 févr. 2018 à 20:11, Jeff Mitchell <notifications@github.com> a
écrit :
… Adding to milestone to pull deps if it works as a fix.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#3896 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABEKdwf05Ek_cA4yrwxer1n1yQiMl7fjks5tQ13pgaJpZM4R3CuP>
.
|
I've reproduced my issue with the 0.9.3 by dropping 80% of the packets from Zookeeper. In only a couple of minute Vault freeze. I've build a new version of Vault including (The important part of the fix seems to be in the samuel/go-zookeeper#181 which is already lives in the master) |
After applying the fix Vault doesn't freeze anymore during network issues. However, after a certain amount of time it seals itself, making him virtually out of service. I've attached some new logs. Network issues starts at 22:00:39 and Vault seals itself at 22:13:51. Here's the relevant part:
Is this "normal" ? |
It looks like a connection to zk was lost and another was reestablished, but not reused. Do you have the zk logs? (With connection ids) I'm not that familiar with the zk -> vault backend. $work hasn't used it for a while now.
|
It's normal in the sense that if Vault is totally unable to even discover whether it is initialized and cannot read its keyring, it can't actually do anything. |
After more tests I'm able to reproduce this behavior and I've also discovered a new one. After simulating network issues between the Vault instances and the Zookeeper cluster for 45min, I've restablished a normal network and now my 3 nodes are in standby with no master in the cluster. They are not sealed, but as there is no master the cluster is not usable.
All 3 instances now repeat the following error in the logs:
I've attached logs of the vault instances and the zookeeper instances. The Zookeeper version is 3.4.11. I've build my own version of Vault that includes the fix discussed above. Finally, here's the iptables rules used on each Vault server to simulate network issues :
vault01.txt |
I've done the same experience with 3 Vault using a Consul cluster as a backend. Same result. After les than 10min, all 3 instances are sealed. The last messages are always the same :
Is this a "normal" behavior? If yes, it bothers me because 10min is not a very time and having a service as central as Vault not being able to "survive" network issues for 10 minutes seems problematic. Vault config:
iptables rules to simulates network issues:
10.105.200.XX are consul servers node. I jam the connection between the consul agent and the consul servers, not between Vault and his agent. |
Should I open a new ticket? |
HA nodes are reliant on locks provided by the storage layer. If those locks time out, they have to give up leadership. When a new node attempts to change to active state, one of the first things it does is it verifies that it has the latest keyring by reading it from storage. This is a safety feature; if the keyring was rotated and it did not pick up on this (let's say, due to network issues), and new data was written with the new key versions on the formerly-active node, you risk a split-brain/data-loss scenario by continuing on with an in-memory cache of the keyring. When it is unable to successfully read its keyring, there isn't much Vault can safely do. So it seals itself. We will always prioritize data safety over administrative ease. Like all benchmarks, yours are artificial. There are many failure situations where a Vault node is totally fine. For instance, in a full network outage going to a standby node, it would never successfully complete grabbing the lock, so it would never try to become the active node, never attempt to read the keyring from storage, never encounter problems reading it from storage, and never seal itself. By doing a probabilstic dropping of packets -- something that in my experience occurs with far less frequency than a full outage -- you're enabling an uncommon failure case, and there isn't currently a known safe way to deal with it. However, any critical infrastructure should have sufficient alerts on logs/metrics. If Vault seals itself, assuming your logs/metrics can make it off the box, operators should be able to react relatively quickly to a) fix the underlying problem and b) unseal Vault. |
This is exactly what I'd expect from a service like vault. Protection over reliability. I don't see a problem with flake or down networks causing vault to eventually seal and instead I call that a feature! |
I totally agree concerning reliability over management ease. My concern was more about the duration. In my simulation, my entire Vault cluster sealed itself after less than 10min of disruption. That's seems short to me. Is there a way to define a timeout or a keyring period in order to Vault to be able to survive a bit longer? |
When you say "a bit longer", what exactly is the length of time that you think is appropriate? 11 minutes? 20? An hour? When Vault seals itself you get an immediate notification that something is wrong, subject to appropriate monitoring. Vault could sit in a spin loop trying over and over to read its keyring. Depending on what the actual problem is, which at Vault's level it may not be privy to (for instance, massive amounts of packet loss and TCP retransmissions, which are simply surfaced as intermittent errors on network calls), this may never succeed and you may end up waiting on Vault for an hour for it to try to recover from an error that it will never recover from, giving you more downtime than if you got an alert immediately due to Vault sealing and were able to fix the problem and unseal. Vault will be unusable in these situations anyways. You're better off knowing about them as soon as possible so you can fix the underlying problem. You are running a synthetic test to simulate failure conditions that are, in my experience, much less common in the real world than failure conditions that Vault generally can successfully recover from, with an arbitrary idea of how Vault should behave in these conditions. I don't really think I can give you any answer that will make you happy here. All I can say is what I said before: we will always prioritize safety over uptime, and just like with any other piece of critical infrastructure, you should have monitoring set up that can alert you to a catastrophic failure state immediately. |
We have multiple Vault instances (in different cluster) using Zookeeper as a backend that suffer complete freeze after some network issues. Symptoms are still the same: Vault doesn't respond anymore and requests (using the CLI or Curl) are stuck for dozens of seconds. A netstat indicates that Vault has no connection to the Zookeeper anymore.
The issue always begin during episode of network issues. Due to network instability, Vault suffers multiple connection failure with Zookeeper and at one point, for no apparent reason, it seems to giveup entirely, leaving the instance without any connection to its backend.
Below are information about one (dev) instance. I've also attached logs (starting at the beginning of the networks issues until we restart it).
Finally, on the same server we have small agent (a sidecar) for registering Vault in our service discovery system. This agent use Zookeeper for registration (the same instance as Vault) and also use the same Go library for the Zookeeper connection (https://github.com/samuel/go-zookeeper). I've also attach the logs of this agent.
What's interesting is that we can see that the agent also suffer connection failure with Zookeeper but eventually recovers and continues as usual the network is one again stable.
vault-logs.txt
agent-logs.txt
Environment:
Vault Config File:
The text was updated successfully, but these errors were encountered: