Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loop to create pgbackrest process increases S3 API usage #868

Open
bzp2010 opened this issue Aug 20, 2024 · 7 comments
Open

Loop to create pgbackrest process increases S3 API usage #868

bzp2010 opened this issue Aug 20, 2024 · 7 comments
Labels
bug Something isn't working

Comments

@bzp2010
Copy link

bzp2010 commented Aug 20, 2024

Report

The upgrade from 2.3.1 to 2.4.1 resulted in a significant increase in API call (especially the list objects API) billing for the S3 service. This is due to a pgbackrest process that someone is constantly creating in the background. No reports were searched.

More about the problem

A few weeks ago I updated my operator deployment from 2.3.1 to 2.4.1, and since then I've noticed that I've been making more calls to my S3 bucket API, which may have increased the monetary cost of the API calls.

So I did some investigating, I checked the logs on the cluster CNI and I found that some processes located in the main database instance were constantly accessing the S3 API.

When I went into the container of the database instance I found some pgbackrest processes being created in a loop and they were using the command pgbackrest info --output=json --repo=1.

I suspected that they were making constant calls to the S3 API, but I didn't know at the time who was starting these processes, so I did a process of elimination, starting with me shutting down operator Pod.

Since then the strange pgbackrest processes have stopped appearing and no program has called the S3 API again, so this proved my suspicions correct.


Investigating the code further, I found that this may be related to the following code, which is apparently constantly making pgbackrest queries in the master database pod via the Kubernetes Exec API.

pgbackrestInfo, err := pgbackrest.GetInfo(ctx, primary, backup.Spec.RepoName)

It has to do with operator's attempts to constantly re-reconcile PGBackup, which, upon checking, happens to have been introduced at the start of 2.4.0, #759. Yes, that fits the guess as well.


The following keeps appearing in the log:

Logs

2024-08-20T13:16:38.840Z	INFO	Starting workers	{"controller": "perconapgrestore", "controllerGroup": "pgv2.percona.com", "controllerKind": "PerconaPGRestore", "worker count": 1}
2024-08-20T13:16:38.840Z	INFO	Starting workers	{"controller": "pgupgrade", "controllerGroup": "postgres-operator.crunchydata.com", "controllerKind": "PGUpgrade", "worker count": 1}
2024-08-20T13:16:38.939Z	INFO	Starting workers	{"controller": "postgrescluster", "controllerGroup": "postgres-operator.crunchydata.com", "controllerKind": "PostgresCluster", "worker count": 2}
2024-08-20T13:16:38.965Z	INFO	Starting workers	{"controller": "pgadmin", "controllerGroup": "postgres-operator.crunchydata.com", "controllerKind": "PGAdmin", "worker count": 1}
2024-08-20T13:16:44.312Z	INFO	Superusers are exposed through PGBouncer	{"controller": "postgrescluster", "controllerGroup": "postgres-operator.crunchydata.com", "controllerKind": "PostgresCluster", "PostgresCluster": {"name":"postgres","namespace":"database"}, "namespace": "database", "name": "postgres", "reconcileID": "37cb444a-893f-4b5d-a95f-d314eafecb12"}
+ 2024-08-20T13:16:52.295Z	INFO	Got latest restorable timestamp	{"controller": "perconapgbackup", "controllerGroup": "pgv2.percona.com", "controllerKind": "PerconaPGBackup", "PerconaPGBackup": {"name":"postgres-repo1-diff-d89bt","namespace":"database"}, "namespace": "database", "name": "postgres-repo1-diff-d89bt", "reconcileID": "5dee8eb6-657b-4be9-a6ee-b9c39a232ec0", "request": {"name":"postgres-repo1-diff-d89bt","namespace":"database"}, "timestamp": "2024-08-20 13:16:20.639199 +0000 UTC"}
2024-08-20T13:16:58.033Z	INFO	Superusers are exposed through PGBouncer	{"controller": "postgrescluster", "controllerGroup": "postgres-operator.crunchydata.com", "controllerKind": "PostgresCluster", "PostgresCluster": {"name":"postgres","namespace":"database"}, "namespace": "database", "name": "postgres", "reconcileID": "9bc26f03-85a3-466d-822a-446ba8f42987"}
+ 2024-08-20T13:17:02.822Z	INFO	Latest commit timestamp	{"controller": "perconapgcluster", "controllerGroup": "pgv2.percona.com", "controllerKind": "PerconaPGCluster", "PerconaPGCluster": {"name":"postgres","namespace":"database"}, "namespace": "database", "name": "postgres", "reconcileID": "2f1a14dd-6132-4de7-a9a5-66ad58d68c1f", "timestamp": "2024-08-20 13:16:20.639199 +0000 UTC", "latestRestorableTime": "2024-08-20 13:03:15.036014 +0000 UTC"}
+ 2024-08-20T13:17:02.822Z	INFO	Triggering PGBackup reconcile	{"controller": "perconapgcluster", "controllerGroup": "pgv2.percona.com", "controllerKind": "PerconaPGCluster", "PerconaPGCluster": {"name":"postgres","namespace":"database"}, "namespace": "database", "name": "postgres", "reconcileID": "2f1a14dd-6132-4de7-a9a5-66ad58d68c1f", "latestBackup": "postgres-repo1-incr-m24n6", "latestRestorableTime": "2024-08-20 13:03:15.036014 +0000 UTC", "latestCommitTimestamp": "2024-08-20 13:16:20.639199 +0000 UTC"}
+ 2024-08-20T13:17:05.848Z	INFO	Got latest restorable timestamp	{"controller": "perconapgbackup", "controllerGroup": "pgv2.percona.com", "controllerKind": "PerconaPGBackup", "PerconaPGBackup": {"name":"postgres-repo1-diff-fjg5l","namespace":"database"}, "namespace": "database", "name": "postgres-repo1-diff-fjg5l", "reconcileID": "e2e1fef5-0034-411c-a356-abef3df2541b", "request": {"name":"postgres-repo1-diff-fjg5l","namespace":"database"}, "timestamp": "2024-08-20 13:16:20.639199 +0000 UTC"}
2024-08-20T13:17:11.291Z	INFO	Superusers are exposed through PGBouncer	{"controller": "postgrescluster", "controllerGroup": "postgres-operator.crunchydata.com", "controllerKind": "PostgresCluster", "PostgresCluster": {"name":"postgres","namespace":"database"}, "namespace": "database", "name": "postgres", "reconcileID": "d3ef925f-41d4-476f-b11e-84560de7313b"}
+ 2024-08-20T13:17:15.067Z	INFO	Latest commit timestamp	{"controller": "perconapgcluster", "controllerGroup": "pgv2.percona.com", "controllerKind": "PerconaPGCluster", "PerconaPGCluster": {"name":"postgres","namespace":"database"}, "namespace": "database", "name": "postgres", "reconcileID": "2f1a14dd-6132-4de7-a9a5-66ad58d68c1f", "timestamp": "2024-08-20 13:16:20.639199 +0000 UTC", "latestRestorableTime": "2024-08-20 13:03:15.036014 +0000 UTC"}
+ 2024-08-20T13:17:15.067Z	INFO	Triggering PGBackup reconcile	{"controller": "perconapgcluster", "controllerGroup": "pgv2.percona.com", "controllerKind": "PerconaPGCluster", "PerconaPGCluster": {"name":"postgres","namespace":"database"}, "namespace": "database", "name": "postgres", "reconcileID": "2f1a14dd-6132-4de7-a9a5-66ad58d68c1f", "latestBackup": "postgres-repo1-incr-m24n6", "latestRestorableTime": "2024-08-20 13:03:15.036014 +0000 UTC", "latestCommitTimestamp": "2024-08-20 13:16:20.639199 +0000 UTC"}
+ 2024-08-20T13:17:18.596Z	INFO	Got latest restorable timestamp	{"controller": "perconapgbackup", "controllerGroup": "pgv2.percona.com", "controllerKind": "PerconaPGBackup", "PerconaPGBackup": {"name":"postgres-repo1-full-647nr","namespace":"database"}, "namespace": "database", "name": "postgres-repo1-full-647nr", "reconcileID": "708d3e2c-28c8-42af-8680-c615cfea55e0", "request": {"name":"postgres-repo1-full-647nr","namespace":"database"}, "timestamp": "2024-08-20 13:16:20.639199 +0000 UTC"}
2024-08-20T13:17:24.754Z	INFO	Superusers are exposed through PGBouncer	{"controller": "postgrescluster", "controllerGroup": "postgres-operator.crunchydata.com", "controllerKind": "PostgresCluster", "PostgresCluster": {"name":"postgres","namespace":"database"}, "namespace": "database", "name": "postgres", "reconcileID": "a3062a24-6b6b-48eb-baf9-ed5764499a93"}
+ 2024-08-20T13:17:27.231Z	INFO	Latest commit timestamp	{"controller": "perconapgcluster", "controllerGroup": "pgv2.percona.com", "controllerKind": "PerconaPGCluster", "PerconaPGCluster": {"name":"postgres","namespace":"database"}, "namespace": "database", "name": "postgres", "reconcileID": "2f1a14dd-6132-4de7-a9a5-66ad58d68c1f", "timestamp": "2024-08-20 13:16:20.639199 +0000 UTC", "latestRestorableTime": "2024-08-20 13:03:15.036014 +0000 UTC"}
+ 2024-08-20T13:17:27.231Z	INFO	Triggering PGBackup reconcile	{"controller": "perconapgcluster", "controllerGroup": "pgv2.percona.com", "controllerKind": "PerconaPGCluster", "PerconaPGCluster": {"name":"postgres","namespace":"database"}, "namespace": "database", "name": "postgres", "reconcileID": "2f1a14dd-6132-4de7-a9a5-66ad58d68c1f", "latestBackup": "postgres-repo1-incr-m24n6", "latestRestorableTime": "2024-08-20 13:03:15.036014 +0000 UTC", "latestCommitTimestamp": "2024-08-20 13:16:20.639199 +0000 UTC"}
+ 2024-08-20T13:17:31.686Z	INFO	Got latest restorable timestamp	{"controller": "perconapgbackup", "controllerGroup": "pgv2.percona.com", "controllerKind": "PerconaPGBackup", "PerconaPGBackup": {"name":"postgres-repo1-incr-lv6dp","namespace":"database"}, "namespace": "database", "name": "postgres-repo1-incr-lv6dp", "reconcileID": "2e9daa86-2df8-4758-ac43-82b5f70c4d0c", "request": {"name":"postgres-repo1-incr-lv6dp","namespace":"database"}, "timestamp": "2024-08-20 13:16:20.639199 +0000 UTC"}
2024-08-20T13:17:38.115Z	INFO	Superusers are exposed through PGBouncer	{"controller": "postgrescluster", "controllerGroup": "postgres-operator.crunchydata.com", "controllerKind": "PostgresCluster", "PostgresCluster": {"name":"postgres","namespace":"database"}, "namespace": "database", "name": "postgres", "reconcileID": "306f8c5b-099b-42aa-9764-cae1c3fd0395"}
+ 2024-08-20T13:17:39.776Z	INFO	Latest commit timestamp	{"controller": "perconapgcluster", "controllerGroup": "pgv2.percona.com", "controllerKind": "PerconaPGCluster", "PerconaPGCluster": {"name":"postgres","namespace":"database"}, "namespace": "database", "name": "postgres", "reconcileID": "2f1a14dd-6132-4de7-a9a5-66ad58d68c1f", "timestamp": "2024-08-20 13:17:20.615232 +0000 UTC", "latestRestorableTime": "2024-08-20 13:03:15.036014 +0000 UTC"}
+ 2024-08-20T13:17:39.776Z	INFO	Triggering PGBackup reconcile	{"controller": "perconapgcluster", "controllerGroup": "pgv2.percona.com", "controllerKind": "PerconaPGCluster", "PerconaPGCluster": {"name":"postgres","namespace":"database"}, "namespace": "database", "name": "postgres", "reconcileID": "2f1a14dd-6132-4de7-a9a5-66ad58d68c1f", "latestBackup": "postgres-repo1-incr-m24n6", "latestRestorableTime": "2024-08-20 13:03:15.036014 +0000 UTC", "latestCommitTimestamp": "2024-08-20 13:17:20.615232 +0000 UTC"}
+ 2024-08-20T13:17:45.036Z	INFO	Got latest restorable timestamp	{"controller": "perconapgbackup", "controllerGroup": "pgv2.percona.com", "controllerKind": "PerconaPGBackup", "PerconaPGBackup": {"name":"postgres-repo1-incr-m24n6","namespace":"database"}, "namespace": "database", "name": "postgres-repo1-incr-m24n6", "reconcileID": "881ef426-eabd-4c37-928c-aebb3bd8221c", "request": {"name":"postgres-repo1-incr-m24n6","namespace":"database"}, "timestamp": "2024-08-20 13:17:20.615232 +0000 UTC"}

They occur very regularly, cyclically, and they are also related to the wal watcher introduced in #759.

I'm wondering if this is indeed as expected (constantly executing pgbackrest info to query backup info from the S3 bucket)?

If this does meet the design intent, how can I turn it off or at least reduce its frequency to keep costs down.

Steps to reproduce

The issue appears to be ingrained in the behavior that is triggered whenever backups are turned on on the 2.4.0+ version of operator.
Without any special triggering method, simply going into the database pod and checking with top reveals that the PID of pgbackrest is incrementing, indicating that the program is being executed in a continuous loop.

So this may not require a step-by-step reproduction step. 🤔

Versions

  1. Kubernetes: 1.30.0
  2. Operator: 2.4.1
  3. Database: PostgreSQL 16

Anything else?

When the operator is shutdown, most of S3 API calls disappear. However, there are still pgbackrest starts with much longer time period.

No response

@bzp2010 bzp2010 added the bug Something isn't working label Aug 20, 2024
@seacom-ms
Copy link

Same thing happened to me but with Azure backups configured. Converting back to s3 backups resolved it for me.

@bzp2010
Copy link
Author

bzp2010 commented Aug 23, 2024

I am using Backblaze Cloud based on S3-compliant API. but I think this has nothing to do with the S3 provider and the issue should happen on every S3 provider.

@egegunes
Copy link
Contributor

egegunes commented Aug 28, 2024

Hey folks, I created https://perconadev.atlassian.net/browse/K8SPG-630 to let you disable this feature so you can control the API usage. It'll be included in upcoming v2.5.0 release

@hors
Copy link
Collaborator

hors commented Aug 28, 2024

@bzp2010 @seacom-ms We run "pgbackrest info --output=json .." be abel to add "latestRestorableTime" into backup object:

apiVersion: pgv2.percona.com/v2
kind: PerconaPGBackup
...
spec:
  options:
  - --type=full
  pgCluster: cluster1
  repoName: repo1
status:
  backupName: 20240828-141255F
  backupType: full
  ....
  jobName: cluster1-backup-2wct
  latestRestorableTime: 2024-08-28 14:15:32.570839+0000

but we understand that we need to have the possibility of disabling it.

@bzp2010
Copy link
Author

bzp2010 commented Aug 28, 2024

Thanks a lot! @egegunes @hors
I'm not very well versed in these components, but as a subjective guess, could this step be executed once at the end of each backup, rather than every few seconds? 🤔

@egegunes
Copy link
Contributor

@bzp2010 the idea is having a background worker that watches commit timestamps and update the latest successful backup's status so users can understand the latest time they can restore to using this backup. if we do this process after a new backup finishes, I don't think we'll provide much value.

we can also add a field for users to control the period of the watcher, so you can have less frequent checks and therefore lesser API calls.

@hors
Copy link
Collaborator

hors commented Aug 28, 2024

@bzp2010 the idea is having a background worker that watches commit timestamps and update the latest successful backup's status so users can understand the latest time they can restore to using this backup. if we do this process after a new backup finishes, I don't think we'll provide much value.

we can also add a field for users to control the period of the watcher, so you can have less frequent checks and therefore lesser API calls.

@bzp2010 It is for PITR, and PpBackrest is uploading WALs into the repo continuously, so we need to update it more than once.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants