-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DPE-3887] Avoid replication slot deletion #680
Conversation
Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #680 +/- ##
==========================================
- Coverage 70.75% 70.72% -0.03%
==========================================
Files 11 11
Lines 2968 2972 +4
Branches 517 518 +1
==========================================
+ Hits 2100 2102 +2
- Misses 757 758 +1
- Partials 111 112 +1 ☔ View full report in Codecov by Sentry. |
Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
…ation-slot-deletion Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
…ation-slot-deletion Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow! This is an excellent finding! Thank you!
Issue
After restarting the pods of a PostgreSQL K8s charm deployment (we can use
kubectl -n dev delete pod -l app.kubernetes.io/name=postgresql-k8s
for testing), if thestop
hook fails for some reason, theupgrade-charm
hook won't fire when the pod is rescheduled, and no Patroni labels will be added to the pod. This makes Patroni understand that the unit is not part of the cluster, so it deletes that unit replication slot from the primary.In the past (before #448, which added a check for
container.can_connect()
), this could happen due to a re-emitted deferred pebble-ready event:Solution
Patch the pods in the
postgresql-pebble-ready
hook (only when the unit is already a member of the Patroni/PostgreSQL cluster), which is fired after the charm starts again. An integration test (forcing a failure in thestop
hook) was added to avoid regressions.Some safeguards still need to be added to the stop hook as a different situation may cause it to fail.
Fixes #433.