-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added HA backend for postgres based on dynamodb model #5731
Conversation
Edit: no longer an issue. |
@bjorndolk I don't really know anything about Postgres so I can't review it. I do know that the Dynamo implementation is pretty jank and a lot of people have had issues with it -- it's the reason ha_enabled is off by default -- so I'm not sure basing an implementation on that model is the right move. |
@jefferai It was the implemention I liked most. One difference in this implementation is that this implementation use a central clock instead on serveral possibly out of sync clocks. I can imagine dynamodb platform having latency issues which may cause problems. A postgres database will typically be responsive. Can you please try to help answering the question wether or not the forward call from standbys to masters is expected to work or not when only implementing the basics in HA backend. I tried to dig into the vault code to find this out myself, but I got lost. I suspect implementation of ServiceDescovery interface may be needed for this to work. For us we, we dont need forward calls to work. We can solve this with loadbalancer config. |
I browsed dynamodb issues: |
and modified code to setup HA tables for tests
Merged with #5926 to have a test that uses a docker backend postgres database as part of testing. |
I have done some more testing and scripted the setup for creating a docker database and three competing vault instances running postgres HA. The script does not verify anything it is only a help to setup an environment to play around in. Killing and starting/unsealing vault instances making sure master is moved correctly. |
…exists, hereby fixing the problem with missing Active Node address for passive nodes.
@jefferai I have now sorted out the problem with forwards calls to passive nodes as discussed above. I had missunderstood the meaning of what a "held" lock is, anyway it is sorted now. As you can see in history this also comes with a docker test setup now. Also I dont think there is a need for deeper postgres knowlegde to review this. The interaction with the database is in my mind simple and easy to comprehend. |
Hi @bjorndolk, Thank you for being patient with us. I'm going to help you get this merged. Bear with me please because I'm still coming up to speed with vault HA backends. I'm going to provide some initial thoughts on your patch in isolation, then spend some time digging in to other HA backends, at which point I'll likely have some further thoughts. |
@ncabatoff, Thanks for not letting my work go to waste. I will dig into your comments hopefully this or next week. |
…mode. Add PostgreSQLLock.renewTicker to manage the lock renewal lifecycle. Add PostgreSQLLock fields ttl, renewInterval, and retryInterval: this allows modifying the default values in tests. In Lock(), use the stopCh argument to control goroutine termination instead of making out own. Remove unused done channel. tryToLock/writeItem() weren't doing quite what the dynamodb versions that inspired them did: in the dynamodb version, when a lock wasn't obtained because it was held by another instance, the lock attempt would be retried. In this version we weren't doing that because any failure to grab the lock for whatever reason was treated as an error and resulted in a write to the errors chan. To address this writeItem now also returns a bool indicating whether it wrote exactly 1 row, indicating that the upsert succeeded. When (false, nil) is returned that means no error occurred but also that no lock was obtained, so tryToLock will retry. periodicallyRenewLock() exits when it doesn't lock successfully, and closes its done channel. This obviates the need for watch(), which has been removed. Unlock() stops the ticker used in periodicallyRenewLock, which saves us from falling victim to the same problem as hashicorp#5828. Added testPostgreSQLLockRenewal which attempts to duplicate in spirit the test from hashicorp#6512, and testPostgresSQLLockTTL which is the exact equivalent of the corresponding DynamoDB test.
postgres ha support review fixes and tests
@ncabatoff I have merged your PR addressing your own review without changes. Thanks for helping out on this. |
@ncabatoff since you are reviewing physical backends more generally. It would be nice if... |
I have successfully tested started 3 instances and made sure active node is moved around while killing the master. Used attached setup. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good! Just a few questions plus some small cleanup suggestions.
@ncabatoff I'll look into this tomorrow. |
Co-Authored-By: bjorndolk <44286676+bjorndolk@users.noreply.github.com>
Space in comment Co-Authored-By: bjorndolk <44286676+bjorndolk@users.noreply.github.com>
Space in comment Co-Authored-By: bjorndolk <44286676+bjorndolk@users.noreply.github.com>
Space in comment Co-Authored-By: bjorndolk <44286676+bjorndolk@users.noreply.github.com>
space in comment Co-Authored-By: bjorndolk <44286676+bjorndolk@users.noreply.github.com>
Accepting ncabatoff changes Addressing review comments from kalafut. General cleanup + handling of no lock found case.
@ncabatoff I have merged your PR without change |
Thanks @bjorndolk, we're grateful that you did the first 90% of the work! |
@ncabatoff @kalafut Thanks for helping out, I think your input and work has simplified and generally greatly improved quality. I think its in good shape. The one thing I can be critical of now is that the timing values for gaurding locks possibly should be softconfigurable. Maybe someone would like to decrease it to one seconds or less so to minimize failover time. But that is a nice to have rather than must have. |
I was keen to see what this would look like in practice, so I created a docker-compose setup to run Postgres, two Vault instances with suicidal tendencies and a third Vault instance to allow them to auto-unseal, Consul to discover which Vaults are healthy, a client to generate load on them, and Prometheus/Grafana/consul-exporter to visualize all that. If you want to give it a try see https://github.com/ncabatoff/visualize-vault-ha . |
@ncabatoff That looks really cool, I wish I had more time to check that out. I did run into problems satisfying the golang 1.11 requirement. I hope to get some time to resolve that and try your stuff. |
…nto postgres-ha-support
This is a clone on the excellent dynamodb implementation, with database specifics replaced to postgres.
There is another related postgres PR #2700. However that implementation does not pass the genric HA tests and I believe its lock model is inferior to the one implemented in dynamodb.
Please consider this contribution as a substitute to #2700