Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DPE-3887] Avoid replication slot deletion #680

Merged
merged 8 commits into from
Sep 17, 2024

Conversation

marceloneppel
Copy link
Member

@marceloneppel marceloneppel commented Sep 3, 2024

Issue

After restarting the pods of a PostgreSQL K8s charm deployment (we can use kubectl -n dev delete pod -l app.kubernetes.io/name=postgresql-k8s for testing), if the stop hook fails for some reason, the upgrade-charm hook won't fire when the pod is rescheduled, and no Patroni labels will be added to the pod. This makes Patroni understand that the unit is not part of the cluster, so it deletes that unit replication slot from the primary.

In the past (before #448, which added a check for container.can_connect()), this could happen due to a re-emitted deferred pebble-ready event:

unit-postgresql-k8s-2: 17:36:20 ERROR unit.postgresql-k8s/2.juju-log Uncaught exception while in charm code:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-postgresql-k8s-2/charm/./src/charm.py", line 1574, in <module>
    main(PostgresqlOperatorCharm, use_juju_for_storage=True)
  File "/var/lib/juju/agents/unit-postgresql-k8s-2/charm/venv/ops/main.py", line 434, in main
    framework.reemit()
  File "/var/lib/juju/agents/unit-postgresql-k8s-2/charm/venv/ops/framework.py", line 863, in reemit
    self._reemit()
  File "/var/lib/juju/agents/unit-postgresql-k8s-2/charm/venv/ops/framework.py", line 942, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-postgresql-k8s-2/charm/./src/charm.py", line 660, in _on_postgresql_pebble_ready
    self._create_pgdata(container)
  File "/var/lib/juju/agents/unit-postgresql-k8s-2/charm/./src/charm.py", line 646, in _create_pgdata
    if not container.exists(path):
  File "/var/lib/juju/agents/unit-postgresql-k8s-2/charm/venv/ops/model.py", line 2547, in exists
    self._pebble.list_files(str(path), itself=True)
  File "/var/lib/juju/agents/unit-postgresql-k8s-2/charm/venv/ops/pebble.py", line 2219, in list_files
    resp = self._request('GET', '/v1/files', query)
  File "/var/lib/juju/agents/unit-postgresql-k8s-2/charm/venv/ops/pebble.py", line 1655, in _request
    response = self._request_raw(method, path, query, headers, data)
  File "/var/lib/juju/agents/unit-postgresql-k8s-2/charm/venv/ops/pebble.py", line 1704, in _request_raw
    raise ConnectionError(
ops.pebble.ConnectionError: Could not connect to Pebble: socket not found at '/charm/containers/postgresql/pebble.socket' (container restarted?)
unit-postgresql-k8s-1: 17:36:20 INFO unit.postgresql-k8s/1.juju-log Updating Patroni config file
unit-postgresql-k8s-0: 17:36:20 INFO juju.worker.uniter.operation ran "stop" hook (via hook dispatching script: dispatch)
unit-postgresql-k8s-0: 17:36:20 INFO juju.worker.uniter unit "postgresql-k8s/0" shutting down: agent should be terminated
unit-postgresql-k8s-1: 17:36:20 INFO unit.postgresql-k8s/1.juju-log Updated health checks
unit-postgresql-k8s-2: 17:36:20 ERROR juju.worker.uniter.operation hook "stop" (via hook dispatching script: dispatch) failed: exit status 1

Solution

Patch the pods in the postgresql-pebble-ready hook (only when the unit is already a member of the Patroni/PostgreSQL cluster), which is fired after the charm starts again. An integration test (forcing a failure in the stop hook) was added to avoid regressions.

Some safeguards still need to be added to the stop hook as a different situation may cause it to fail.

Fixes #433.

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Copy link

codecov bot commented Sep 3, 2024

Codecov Report

Attention: Patch coverage is 88.88889% with 2 lines in your changes missing coverage. Please review.

Project coverage is 70.72%. Comparing base (f203e4d) to head (16bd2b8).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
src/charm.py 88.88% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #680      +/-   ##
==========================================
- Coverage   70.75%   70.72%   -0.03%     
==========================================
  Files          11       11              
  Lines        2968     2972       +4     
  Branches      517      518       +1     
==========================================
+ Hits         2100     2102       +2     
- Misses        757      758       +1     
- Partials      111      112       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
@marceloneppel marceloneppel changed the title [DPE-3887] Avoid replication slot deletion (WIP) [DPE-3887] Avoid replication slot deletion Sep 12, 2024
…ation-slot-deletion

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
…ation-slot-deletion

Signed-off-by: Marcelo Henrique Neppel <marcelo.neppel@canonical.com>
@marceloneppel marceloneppel marked this pull request as ready for review September 16, 2024 20:07
@marceloneppel marceloneppel changed the title (WIP) [DPE-3887] Avoid replication slot deletion [DPE-3887] Avoid replication slot deletion Sep 16, 2024
Copy link
Contributor

@taurus-forever taurus-forever left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow! This is an excellent finding! Thank you!

@marceloneppel marceloneppel merged commit ea87ed7 into main Sep 17, 2024
92 checks passed
@marceloneppel marceloneppel deleted the dpe-3887-avoid-replication-slot-deletion branch September 17, 2024 14:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cluster lost files from data dir from the former primary and its replication slot
4 participants