-
Notifications
You must be signed in to change notification settings - Fork 989
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Documentation no clear on how to clone from S3 backup #858
Comments
The S3 bucket must contain a backup and WALs made by wal-e or wal-g. |
what do you mean by "backup"? files produced by a logical backup or something else? how to I setup wal-e or wal-g using zalando operator? |
@walec51 have a look at the enviroment variables of Spilo. You can overwrite them with an additional configmap which you can reference in the operator configuration. Some things like S3 bucket or IAM role can already be set in the operator configuration. |
This issue or discussion seems to be related to #630 ... @FxKu we just dove into this when trying to configure Spilo (or rather WAL-E) to use credentials from the secret referenced by additional_secret_mount . We like to call this "bring your own bucket" (and AWS credentials). But since those are credentials a Kubernetes secret is the right resource to use here But while you can only provide a single secret and a single mount point, see: postgres-operator/pkg/cluster/k8sres.go Line 1197 in 9ddee8f
but not use the contents as environment variables. When looking at Spilo or rather the configuration script at https://github.com/zalando/spilo/blob/master/postgres-appliance/scripts/configure_spilo.py#L717 it only works on environment variables to decide on which mode (instance role vs. specific IAM credentials) to use. Mounting a secret with keys Also placing an AWS credentials file at So in essence the secret mount, even though documented to help with providing (Cloud SDK) credentials (https://github.com/zalando/postgres-operator/blob/master/docs/administrator.md#access-to-cloud-resources-from-clusters-in-non-cloud-environment) does not work to provide them to i.e. to WAL-E or in a way it works for the configuration concept of Spilo. |
Experimenting with a few different pg-on-k8s solutions at the moment, and pg-operation/spilo/patroni documentation is the most confusing. I've been reading for a solid hour, and I'm still unclear how to initialize a fresh database from a wal-g s3 backup. As best I can tell, it requires a custom configmap to override some spilo env variables which themselves are not documented. I appreciate zalando for open sourcing this, but there should be a foolproof example of how to get wal-g backup initialized. |
@qeternity the functionality is called "clone" - check out: https://github.com/zalando/postgres-operator/blob/master/docs/user.md#how-to-clone-an-existing-postgresql-cluster There you can "clone" a cluster from a "dead" S3 backup (pg_basebackups + shipped WAL files) which are created / managed by WAL-E |
@qeternity , I have the same experience. We have currently tested and reviewed Crunchy, KubeDB and Zalando for postgres operations on k8s. The zalando operator (with spilo, patroni, wal-e / wal-g) are great. But the user, operations and system documentation are very confusing. We had to dig around in the source code of the various components to understand how everything are linked together. Summary (also written at the bottom of this post), for those of you that don't bother reading my complete "comment" ;) The learning curve using the Zalando Postgres Operator was steep, but in the end and when we understood the components and how they work I must say that the operator just works fantastic. TL;DR After understanding that the backup referred to in https://github.com/zalando/postgres-operator/blob/master/docs/user.md#how-to-clone-an-existing-postgresql-cluster is actually the wal backup and not the logical backup, things did become a little more clear. To get wal "backups" to work we had to do a lot of research. The k8s cluster that we used for testing runs in eu-north-1, and also the backup (like in logical backup) and the wal backupes was created in this region. wal-e is written in python and relies on boto (not boto3). Boto is old and does not support eu-north-1, and it did take quite some effort to track the error down to this fact. The k8s cluster also has kiam installed, so we initially looked for problems there. At the same time that the wal-e upload to s3 failed, the logical backup to s3 in the same region worked. The logical backups are carried out by a dedicated cronjob pod, using aws s3 cli to upload the pg_dumpall backups. So totally not related to spilo. Wal-g is written in Go, and works with eu-north-1, however even spilo turns out to work with wal-g the Zalando Postgres Operator supposedly didn't (at least it's not mentioned many places). The solution for us was to add a set of environmental variables to all pods, by using this feature
Unfortunately the configmap has to exist in the same namespace as the postgres cluster is being created, which did make the user experience for non cluster administrators less then optimal. Luckily there is a PR merged into master addressing this issue #870, so we built or own docker image of the Zalando Operator from master. Next issue we encountered was logging. We are using Grafana Loki https://grafana.com/oss/loki/ for all cluster log management. This relies on containers in our cluster to just evict logs to stdout and stderr, and then loki promtail will do the rest. Spilo by default configures postgres to log to a set of csv files, not optimal for our use case. We manged to turn off the logging_collector and configure our own logging preferences by defining config overrides in the postgresql manifest like this.
The tcp_keepalives_* settings are in there because we are using AWS NLB, and not AWS ELB in front of the cluster services. We do have some legacy clients that are not able to send keepaliaves them selves, and some have very long open transactions to the database, lasting for more then 3600 seconds which is the maximum timeout for the traditional ELB. The timeout of the NLB on the other hand is not configurable. So any connection that are not kept open by pings or real requests are terminated by the NLB after 350 seconds. Getting the operator to create and internal facing AWS NLB instead of an ELB is not documented, but we manged to get that going by adding some extra config to the operator
and to the postgresql manifests.
We manage all metrics and alerts in our k8s clusters using Prometheus, AlertManager and Grafana. So the next issue was to get the commonly used postgres-exporter to work https://github.com/wrouesnel/postgres_exporter. There are some issues and discussions #264 related to this topic. In my opinion this is a very common use case, and it can pretty easily be addressed. But it's not documented end to end anywhere. We added a generic sidecar to enable the postgres-exporter. The important part here is how we have made the sidecar generic, by reusing the env variables that are added to all containers by the operator https://github.com/zalando/postgres-operator/blob/master/docs/user.md#sidecar-support .
Unfortunately this sidecar has to be added to all postgresql manifests deployed by our developers, but that will be addressed soon #890. Adding the sidecar is unfortunately not enough. You will also have to add a dedicated service that selects all the postgres cluster pods, so that you can scrape metrics not only from the master or the replica pods, but all pods in the postgres cluster.
This is unfortunately something that every user that will deploy postgresql manifests has do add, because it can not be made generic and there is no way to get the operator to create this service. So we will provide a PR enabling a simple keyword in the postgresql spec, that will add every component that is needed to enable the postgres-exporter. We will maintain a fork if it doesn't get approved and merged due to the statement given by @Jan-M in #264 (comment) Quick answer is no, there is no intent to make the operator "monitor" anything. Ideally the operator focuses on "operation" and more specifically on the provisioning and modifying part. The "ops" part we largely leave to Patroni which is very well suited for taking care of the cluster itself. The learning curve using the Zalando Postgres Operator was steep, but in the end and when we understood the components and how they work I must say that the operator just works fantastic. KubeDB hade some serious issues with dataloss, and Crunchy is just not focusing on enpowering the regular developer with postgres capabilities inside Kubernetes. It looks to me more like and "RDS replacement" for database management teams. |
No single source of truth. Very sad about the quality of this operator documentation ! clone:
uid: 8f74a620-36e6-41fe-862c-816a713c30b8
cluster: prod-db
timestamp: 2020-07-11T11:20:08+01:00
# s3_wal_path: s3://db-backup/spilo/prod-db
s3_access_key_id: XXXXXXX
s3_secret_access_key: XXXXXX |
@abdennour pasting parts from patroni and postgres logs could help us to help you |
@frittentheke you mentioned cloning can be done from "dead" S3 backup (pg_basebackups + shipped WAL files) which are created / managed by WAL-E As far as i could find in documentation, logicalbackups are stored to S3. But how to store pg_basebackups to S3? How can WAL files be shipped to S3? is this process manual or is there a parameter we can set in pg-operator to do this job ?? And if recovery from logicalbackup in s3 is not possible, then what is the point of logical backup? Could you please elaborate a bit as im relatively new to this. Thanks in advance. |
Thank you! |
Should be clearer now with the parts I've added to the docs with #1592 |
The documentation suggest that we can create a clone from a S3 bucket but in does not state explicitly in what format should be backup be there:
https://postgres-operator.readthedocs.io/en/latest/user/#how-to-clone-an-existing-postgresql-cluster
one could assume that this is the format used by logical backups and as their documentations states
pg_dumpall
:https://postgres-operator.readthedocs.io/en/latest/administrator/#logical-backups
but I think that is not the case as in the further documentation I can see that this is a basebackup which would suggest the usage of
pg_basebackup
:https://postgres-operator.readthedocs.io/en/latest/reference/cluster_manifest/#parameters-defining-how-to-clone-the-cluster-from-another-one
the two facts above would suggest that there logical backups feature is incompatible with the clone feature and I have to exec in to the cluster and do a
pg_basebackup
to S3 my self by hand.But if that is the case then I don't understand this part of the documentation:
this would suggest that there is some mechanism or convention is which the S3 bucket should be created but I cannot find any documentation for this.
I would be grateful if someone would explain to me how this cloning functionality should work in detail and help to improve the documentation.
The text was updated successfully, but these errors were encountered: