Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bitnami/postgresql-repmgr] After a 3-node cluster, simultaneous restart of all nodes resulted in failure to start properly. #67372

Closed
kwenzh opened this issue May 31, 2024 · 9 comments · Fixed by #67370
Assignees
Labels
postgresql-repmgr solved stale 15 days without activity tech-issues The user has a technical issue about an application triage Triage is needed

Comments

@kwenzh
Copy link
Contributor

kwenzh commented May 31, 2024

Name and Version

bitnami/postgresql-repmgr:15.5.0-debian-11-r15

What architecture are you using?

amd64

What steps will reproduce the bug?

  1. deploy a 3 node postgres cluster in k8s v1.23, deploy in static pod mode
  2. check it work right, there are one primary and two standby
  3. stop postgres in each node, almost at the same time
  4. check each postgres has stop
  5. start each postgres, almost at the same time
  6. check each postgres status

What is the expected behavior?

The postgres cluster can recover into healthz

What do you see instead?

each pg pod in crash, can not election the primary

pg pod exit log

[2024-05-31 18:54:06] [ERROR] unable connect to upstream node (ID: 2337678), terminating
[2024-05-31 18:54:06] [HINT] upstream node must be running before repmgrd can start
[2024-05-31 18:54:06] [INFO] repmgrd terminating...

Additional information

wal_log_hints = 'on'

reconnect_attempts='24'
reconnect_interval='5'

node_rejoin_timeout=300
@kwenzh kwenzh added the tech-issues The user has a technical issue about an application label May 31, 2024
@github-actions github-actions bot added the triage Triage is needed label May 31, 2024
@kwenzh
Copy link
Contributor Author

kwenzh commented May 31, 2024

For example, there are ABC nodes. Before they are all shut down, C is the master node. After they are pulled up at the same time, C is not running right. After B is started, it cannot find the master node. repmgr directly runs the postgres process in B, After C is started in this short period of time, it connects to B's 5432 service and finds that the master is C. The unfiltered IP address is C itslef, which causes an attempt to connect but fails to connect, resulting in a circular dependency between B and C.

@kwenzh
Copy link
Contributor Author

kwenzh commented May 31, 2024

only check primary node is itself in https://github.com/bitnami/containers/blob/main/bitnami/postgresql-repmgr/15/debian-12/rootfs/opt/bitnami/scripts/librepmgr.sh#L224
image

but no check itself in https://github.com/bitnami/containers/blob/main/bitnami/postgresql-repmgr/15/debian-12/rootfs/opt/bitnami/scripts/librepmgr.sh#L240

when repmgr get primary node is itself, from other nodes postgres service, it retry connect self postgres serivce, but itself is not running ready

@kwenzh
Copy link
Contributor Author

kwenzh commented May 31, 2024

Look similar #999

@carrodher
Copy link
Member

Thank you for bringing this issue to our attention. We appreciate your involvement! If you're interested in contributing a solution, we welcome you to create a pull request. The Bitnami team is excited to review your submission and offer feedback. You can find the contributing guidelines here.

Your contribution will greatly benefit the community. Feel free to reach out if you have any questions or need assistance.

@kwenzh
Copy link
Contributor Author

kwenzh commented Jun 3, 2024

Thank you for bringing this issue to our attention. We appreciate your involvement! If you're interested in contributing a solution, we welcome you to create a pull request. The Bitnami team is excited to review your submission and offer feedback. You can find the contributing guidelines here.

Your contribution will greatly benefit the community. Feel free to reach out if you have any questions or need assistance.

look at this #67370

@carrodher
Copy link
Member

Thank you for opening this issue and submitting the associated Pull Request. Our team will review and provide feedback. Once the PR is merged, the issue will automatically close.

Your contribution is greatly appreciated!

@kwenzh
Copy link
Contributor Author

kwenzh commented Jun 3, 2024

Thank you for opening this issue and submitting the associated Pull Request. Our team will review and provide feedback. Once the PR is merged, the issue will automatically close.

Your contribution is greatly appreciated!

Feel free to reach out if you have any questions or need assistance.

Copy link

This Issue has been automatically marked as "stale" because it has not had recent activity (for 15 days). It will be closed if no further activity occurs. Thanks for the feedback.

@github-actions github-actions bot added the stale 15 days without activity label Jun 19, 2024
Copy link

Due to the lack of activity in the last 5 days since it was marked as "stale", we proceed to close this Issue. Do not hesitate to reopen it later if necessary.

@bitnami-bot bitnami-bot closed this as not planned Won't fix, can't repro, duplicate, stale Jun 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
postgresql-repmgr solved stale 15 days without activity tech-issues The user has a technical issue about an application triage Triage is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants