kube-apiserver crashes during pgbackrest backups #1539

iohenkies · 2023-07-13T05:54:23Z

Hi all,

Originally I posted this at pgbackrest/pgbackrest#2118 but was advised to give it a go here. So hopefully you have any idea? :)

We've got a 85 node cluster running all sorts of stuff. Control planes and etcd are separated from our workers and from each other, so all separate nodes. Then we have for instance Elasticsearch on a separate nodepool, a lot of workers for all kinds of apps, and our Postgres databases on a separate nodepool. These are 11 nodes with 8vCPU and 32GB mem each.

At 2am and 6am about 60 pgbackrest backups are started. This often, but not always, makes our kube-apiserver containers on our control planes crash. This is very strange to us, because why would pgbackrest cause such a constraint on the apiserver? We've tried to replicate this issue by spawning 300 pods with another app at the same time, calling the apiserver, and then the kube-apiserver remains running. It only seems to be happening during these backups.

We have audit logging enabled on the kube-apiserver and up till right before the crashes, we don't see anything unusual, but then it gets too busy and crashes and we probably can't catch the very end of the logs. The only thing in the pgbackrest logs that sticks out is quite a lot of these apiserver was unable to write a JSON response: http: Handler timeout errors. Not only during crash, but also during the day.

Now, we are no database experts, our DBA colleague who was the lead in setting up Postgres is on a long sick leave, so we're hoping to make use if the expertise here! Maybe there are settings there can be tweaked? Or explained what and if pgbackrest is doing a lot of calls to the apiserver?

pgBackRest version:
pgBackRest 2.40
PostgreSQL version:
postgres (PostgreSQL) 14.5
Operating system/version - if you have more than one server (for example, a database server, a repository host server, one or more standbys), please specify each:
Kubernetes 1.24.10 on Ubuntu 20.04.5 LTS nodes
Did you install pgBackRest from source or from a package?
Installed on Kubernetes 1.24.10, running image registry.developers.crunchydata.com/crunchydata/postgres-operator:ubi8-5.2.0-0
Please attach the following as applicable:
pgbackrest conf

bash-4.4$ cat pgbackrest_instance.conf
# Generated by postgres-operator. DO NOT EDIT.
# Your changes will not be saved.

[global]
buffer-size = 2MiB
compress-type = lz4
log-path = /pgdata/pgbackrest/log
process-max = 2
repo1-path = /pgbackrest/grafana/grafana
repo1-retention-full = 2
repo1-retention-full-type = time
repo1-s3-bucket = npo
repo1-s3-endpoint = storagegrid.s3.ourdomain.com
repo1-s3-port = 443
repo1-s3-region = NL-AER-1
repo1-s3-uri-style = path
repo1-storage-ca-file = /etc/pgbackrest/conf.d/root.pem
repo1-storage-verify-tls = y
repo1-type = s3

[db]
pg1-path = /pgdata/pg14
pg1-port = 5432
pg1-socket-path = /tmp/postgres

Backup command

bash -ceu --  shopt -s globstar files=(/etc/pgbackrest/conf.d/**) for i in "${!files[@]}"; do ?[[ -f "${files[$i]}" ]] || unset -v "files[$i]" done declare -r hash="$1" local_hash="$(sha1sum "${files[@]}" | sha1sum)"  if [[ "${local_hash}" != "${hash}" ]]; then ?printf >&2 "hash %s does not match local hash %s" "${hash}" "${local_hash}"; exit 1; else ?pgbackrest backup --stanza=db --repo=1 --type=incr fi  - 725c12672026deac030f95c75a5abee7186e180a  -

Errors in log

apiserver was unable to write a JSON response: http: Handler timeout

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kube-apiserver crashes during pgbackrest backups #1539

kube-apiserver crashes during pgbackrest backups #1539

iohenkies commented Jul 13, 2023

kube-apiserver crashes during pgbackrest backups #1539

kube-apiserver crashes during pgbackrest backups #1539

Comments

iohenkies commented Jul 13, 2023