Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added HA backend for postgres based on dynamodb model #5731

Merged
merged 26 commits into from
May 10, 2019

Conversation

bjorndolk
Copy link
Contributor

This is a clone on the excellent dynamodb implementation, with database specifics replaced to postgres.

There is another related postgres PR #2700. However that implementation does not pass the genric HA tests and I believe its lock model is inferior to the one implemented in dynamodb.

Please consider this contribution as a substitute to #2700

@bjorndolk
Copy link
Contributor Author

bjorndolk commented Nov 8, 2018

Edit: no longer an issue.
Reviewer should probalby take a peek at this thread https://groups.google.com/forum/#!topic/vault-tool/1Sx2aedKt94

@jefferai
Copy link
Member

jefferai commented Nov 8, 2018

@bjorndolk I don't really know anything about Postgres so I can't review it. I do know that the Dynamo implementation is pretty jank and a lot of people have had issues with it -- it's the reason ha_enabled is off by default -- so I'm not sure basing an implementation on that model is the right move.

@bjorndolk
Copy link
Contributor Author

bjorndolk commented Nov 9, 2018

@jefferai It was the implemention I liked most. One difference in this implementation is that this implementation use a central clock instead on serveral possibly out of sync clocks. I can imagine dynamodb platform having latency issues which may cause problems. A postgres database will typically be responsive.

Can you please try to help answering the question wether or not the forward call from standbys to masters is expected to work or not when only implementing the basics in HA backend. I tried to dig into the vault code to find this out myself, but I got lost. I suspect implementation of ServiceDescovery interface may be needed for this to work. For us we, we dont need forward calls to work. We can solve this with loadbalancer config.

@bjorndolk
Copy link
Contributor Author

bjorndolk commented Nov 9, 2018

I browsed dynamodb issues:
#5077
This implementation separates HA data from Data-data :). Dynamodb stores the ha_keyvalue in the key-value store rather in a HA key store as this implementation does. I chosed to do so since its easier to do transactions on one table rather than two.

physical/postgresql/postgresql_test.go Outdated Show resolved Hide resolved
@bjorndolk
Copy link
Contributor Author

Merged with #5926 to have a test that uses a docker backend postgres database as part of testing.

@bjorndolk
Copy link
Contributor Author

bjorndolk commented Jan 10, 2019

I have done some more testing and scripted the setup for creating a docker database and three competing vault instances running postgres HA. The script does not verify anything it is only a help to setup an environment to play around in. Killing and starting/unsealing vault instances making sure master is moved correctly.
It would be nice if this type of tests was part of standard backend storage unit tests in a generic form to allow for all backends to be tested. Hopefully someone will provide that in the future.

vault-pg-ha-test.txt

…exists, hereby fixing the problem with missing Active Node address for passive nodes.
@bjorndolk
Copy link
Contributor Author

bjorndolk commented Jan 10, 2019

@jefferai I have now sorted out the problem with forwards calls to passive nodes as discussed above. I had missunderstood the meaning of what a "held" lock is, anyway it is sorted now. As you can see in history this also comes with a docker test setup now.

Also I dont think there is a need for deeper postgres knowlegde to review this. The interaction with the database is in my mind simple and easy to comprehend.

@ncabatoff
Copy link
Collaborator

Hi @bjorndolk,

Thank you for being patient with us. I'm going to help you get this merged. Bear with me please because I'm still coming up to speed with vault HA backends. I'm going to provide some initial thoughts on your patch in isolation, then spend some time digging in to other HA backends, at which point I'll likely have some further thoughts.

physical/postgresql/postgresql.go Show resolved Hide resolved
physical/postgresql/postgresql.go Outdated Show resolved Hide resolved
physical/postgresql/postgresql.go Outdated Show resolved Hide resolved
physical/postgresql/postgresql.go Outdated Show resolved Hide resolved
physical/postgresql/postgresql.go Show resolved Hide resolved
physical/postgresql/postgresql.go Outdated Show resolved Hide resolved
physical/postgresql/postgresql.go Outdated Show resolved Hide resolved
physical/postgresql/postgresql.go Show resolved Hide resolved
@bjorndolk
Copy link
Contributor Author

@ncabatoff, Thanks for not letting my work go to waste. I will dig into your comments hopefully this or next week.

…mode.

Add PostgreSQLLock.renewTicker to manage the lock renewal lifecycle.

Add PostgreSQLLock fields ttl, renewInterval, and retryInterval: this
allows modifying the default values in tests.

In Lock(), use the stopCh argument to control goroutine termination
instead of making out own.  Remove unused done channel.

tryToLock/writeItem() weren't doing quite what the dynamodb versions that
inspired them did: in the dynamodb version, when a lock wasn't obtained
because it was held by another instance, the lock attempt would be
retried.  In this version we weren't doing that because any failure to
grab the lock for whatever reason was treated as an error and resulted in
a write to the errors chan.

To address this writeItem now also returns a bool indicating whether it
wrote exactly 1 row, indicating that the upsert succeeded.  When
(false, nil) is returned that means no error occurred but also that no
lock was obtained, so tryToLock will retry.

periodicallyRenewLock() exits when it doesn't lock successfully, and
closes its done channel.  This obviates the need for watch(), which has
been removed.

Unlock() stops the ticker used in periodicallyRenewLock, which
saves us from falling victim to the same problem as hashicorp#5828.

Added testPostgreSQLLockRenewal which attempts to duplicate in spirit
the test from hashicorp#6512, and testPostgresSQLLockTTL which is the exact
equivalent of the corresponding DynamoDB test.
postgres ha support review fixes and tests
@bjorndolk
Copy link
Contributor Author

@ncabatoff I have merged your PR addressing your own review without changes. Thanks for helping out on this.

@bjorndolk
Copy link
Contributor Author

bjorndolk commented Apr 17, 2019

@ncabatoff since you are reviewing physical backends more generally. It would be nice if...
*There was a implemation of the logic by hashicorp and a backend could be developed only implemnting database specific operations. Instead of copy pasting antother backend logic as done here.
*Better generic backend tests that actually started serveral running nodes and beeing killed/restarted. I bet if those existed some very serious bugs would be found in some existing backends.

@bjorndolk
Copy link
Contributor Author

I have successfully tested started 3 instances and made sure active node is moved around while killing the master. Used attached setup.
ha-test-setup.txt

Copy link
Contributor

@kalafut kalafut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good! Just a few questions plus some small cleanup suggestions.

@bjorndolk
Copy link
Contributor Author

@ncabatoff I'll look into this tomorrow.

ncabatoff and others added 2 commits April 25, 2019 15:43
Co-Authored-By: bjorndolk <44286676+bjorndolk@users.noreply.github.com>
kalafut and others added 6 commits April 26, 2019 07:15
Space in comment

Co-Authored-By: bjorndolk <44286676+bjorndolk@users.noreply.github.com>
Space in comment

Co-Authored-By: bjorndolk <44286676+bjorndolk@users.noreply.github.com>
Space in comment

Co-Authored-By: bjorndolk <44286676+bjorndolk@users.noreply.github.com>
space in comment

Co-Authored-By: bjorndolk <44286676+bjorndolk@users.noreply.github.com>
Accepting ncabatoff changes Addressing review comments from kalafut. General cleanup + handling of no lock found case.
@bjorndolk
Copy link
Contributor Author

bjorndolk commented Apr 26, 2019

@ncabatoff I have merged your PR without change
I granted you access to my fork. I put my full trust in you, You are currently the one doing the work, I am grateful for it.

@ncabatoff
Copy link
Collaborator

@ncabatoff I have merged your PR without change
I granted you access to my fork. I put my full trust in you, You are currently the one doing the work, I am grateful for it.

Thanks @bjorndolk, we're grateful that you did the first 90% of the work!

@bjorndolk
Copy link
Contributor Author

@ncabatoff @kalafut Thanks for helping out, I think your input and work has simplified and generally greatly improved quality. I think its in good shape. The one thing I can be critical of now is that the timing values for gaurding locks possibly should be softconfigurable. Maybe someone would like to decrease it to one seconds or less so to minimize failover time. But that is a nice to have rather than must have.

@ncabatoff
Copy link
Collaborator

ncabatoff commented Apr 28, 2019

I was keen to see what this would look like in practice, so I created a docker-compose setup to run Postgres, two Vault instances with suicidal tendencies and a third Vault instance to allow them to auto-unseal, Consul to discover which Vaults are healthy, a client to generate load on them, and Prometheus/Grafana/consul-exporter to visualize all that. If you want to give it a try see https://github.com/ncabatoff/visualize-vault-ha .

@bjorndolk
Copy link
Contributor Author

@ncabatoff That looks really cool, I wish I had more time to check that out. I did run into problems satisfying the golang 1.11 requirement. I hope to get some time to resolve that and try your stuff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants