Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DPE-5311] Enable DCS failsafe mode #677

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

dragomirp
Copy link
Contributor

Issue

Patroni depends on the K8s API and resources to maintain the leader lock. Primary will step down during a temporary K8s API outage

Solution

Enable DCS failsafe mode. If the K8s API (DCS) is not available, Patroni will try to maintain the current primary if it can connect to all cluster members

Documentation: https://patroni.readthedocs.io/en/latest/dcs_failsafe_mode.html#dcs-failsafe-mode

Related #669 and #616

Copy link

codecov bot commented Sep 3, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.71%. Comparing base (fde548d) to head (0f7c923).

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #677   +/-   ##
=======================================
  Coverage   70.71%   70.71%           
=======================================
  Files          11       11           
  Lines        2950     2950           
  Branches      513      513           
=======================================
  Hits         2086     2086           
  Misses        754      754           
  Partials      110      110           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@taurus-forever
Copy link
Contributor

taurus-forever commented Sep 17, 2024

@7annaba3l we are pending your ACK to merge this.

I believe we should enable failsafe, @delgod had doubts about the split brain but manuals says it is safe:
https://patroni.readthedocs.io/en/latest/dcs_failsafe_mode.html#dcs-failsafe-mode

If the failsafe mode is enabled and the leader lock update in DCS failed due to
reasons different from the version/value/index mismatch,
Postgres may continue to run as a primary if it can access all
known members of the cluster via Patroni REST API.
...
The primary will execute the failsafe code and contact all known replicas.
These replicas will use this information as an indicator that the primary is
alive and will not start the leader race even if the leader lock in DCS has expired.

4 @dragomirp : I want to merge this and fix #669 and #616.

Copy link
Member

@delgod delgod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after careful check of implementation details - everything looks safe (even in upgrade case) and I believe we should enable this option.

Copy link
Contributor

@taurus-forever taurus-forever left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mykola approved, Mohamed notified in MM. LGTM from me. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants