[DPE-4427] Address main instability sources on backups integration tests #496

lucasgameiroborges · 2024-06-04T20:12:07Z

Issue

The test_backups.py integration test has been one of the main sources of CI failure in recent past. This PR aims to help stabilize the test.

Solution

This PR addresses a number of root causes to the problem, namely:

DPE-4427 hook failed: "certificates-relation-changed": The function call push_tls_files_to_workload() was failing with a transient connection error after multiple retries, but tenacity's RetryError was not one of the catched exceptions, thus making the hook fail instead of deferring.
DPE-4425 charm stuck on scale-up: Caused by an infinite defer loop on on_peer_relation_changed event because cluster was not initialized. Added an extra check to initialize cluster if on leader unit, which breaks the loop.
DPE-2107 Switchover on scale-down: If the primary unit is the one being removed in a scale-down, the charm doesn't recover ==> implement a switchover on storage_detaching hook.
[No ticket?] Add connection check after restart operation: implements a _can_connect_to_postgresql check that is verified on restart hook.
[No ticket?] update_read_only_endpoint call may fail due to permission denied when updating databag/secret in the case of read-only cluster in the async_replication test ==> catch and log the error and move forward with the event.

Related Follow-ups

test_backups.py charm may get stuck when TLS and S3-config events happen at the same time (?): https://warthogs.atlassian.net/browse/DPE-4596
test_charm.py instability over the shared_buffers setting: https://warthogs.atlassian.net/browse/DPE-4594
test_tls.py instability due to pg_rewind sometimes not using TLS: https://warthogs.atlassian.net/browse/DPE-4590

If you see an immediate solution for the issues mentioned, please let me know so I can test it and include in the PR. Otherwise, we have tickets for further investigation.

lucasgameiroborges · 2024-06-10T14:10:27Z

tests/integration/test_backups.py

+        async with ops_test.fast_forward(fast_interval="60s"):
+            await ops_test.model.wait_for_idle(
+                apps=[database_app_name], status="active", timeout=1000
+            )


This was added in order to avoid the charm getting stuck due to https://warthogs.atlassian.net/browse/DPE-4596. Once underlying issue is resolved, this change should get reverted, but we shouldn't let CI suffer in the meantime IMO

lucasgameiroborges · 2024-06-10T14:12:01Z

tests/integration/test_backups.py

-        # Run the "create backup" action.
+        # With a stable cluster, Run the "create backup" action
+        async with ops_test.fast_forward():
+            await ops_test.model.wait_for_idle(status="active", timeout=1000, idle_period=30)


This was added because, in rare cases, the charm was not idle/ready for a backup to be created.

tests/integration/test_charm.py

src/charm.py

lucasgameiroborges · 2024-06-10T19:45:43Z

For the record: this full CI retry was the first time I've seen our entire CI pass in one go!

taurus-forever

Let me trust CI/CD here. LGTM! Thank you!

src/charm.py

marceloneppel

LGTM! Please just update the PostgreSQL TLS library LIBPATCH.

lib/charms/postgresql_k8s/v0/postgresql_tls.py

src/charm.py

tests/integration/test_charm.py

add postgres connection check

eeba554

github-actions bot added the Libraries: Out of sync label Jun 4, 2024

lucasgameiroborges added 9 commits June 4, 2024 22:24

add leader check in peer event + remove restart

6056279

remove unit test

375cf01

refactor initialization check

42d32f5

catch RetryError and defer tls event

e1bca0e

add primary switchover on scale-down

7d0d012

revert patroni restart change

27849d4

add sleep to tls check and catch modelError

97cbc63

wait before first backup + better logging on TLS test

7bac81d

add wait for model in TLS test + idle timeout

15c9b96

lucasgameiroborges changed the title ~~add postgres connection check~~ [DPE-4427] Address main instability sources on backups integration tests Jun 6, 2024

lucasgameiroborges added 11 commits June 6, 2024 23:16

catch retry error from patroni call + increase retries

dcb554a

revert retry attempt limit change

9885db7

revert wait model on tls test

f0f9c0e

update test_charm parameter

d174289

refactor on initialize cluster check + make self_healing fail fast

911dc24

Merge remote-tracking branch 'origin/main' into lucas/stabilize-test

b8a63a7

add shared buffers to dynamic config

4b0e22c

revert + make wait for TLS in test_backups

6150258

clean previous changes

1767c71

try waiting for idle after relating

c3dda0f

reposition wait to avoid charm stuck

9691812

lucasgameiroborges commented Jun 10, 2024

View reviewed changes

dragomirp reviewed Jun 10, 2024

View reviewed changes

tests/integration/test_charm.py Show resolved Hide resolved

dragomirp reviewed Jun 10, 2024

View reviewed changes

src/charm.py Outdated Show resolved Hide resolved

lucasgameiroborges marked this pull request as ready for review June 10, 2024 16:53

lucasgameiroborges requested review from marceloneppel and taurus-forever June 10, 2024 16:58

taurus-forever approved these changes Jun 11, 2024

View reviewed changes

src/charm.py Outdated Show resolved Hide resolved

lucasgameiroborges requested a review from dragomirp June 11, 2024 14:58

marceloneppel approved these changes Jun 11, 2024

View reviewed changes

lib/charms/postgresql_k8s/v0/postgresql_tls.py Show resolved Hide resolved

src/charm.py Outdated Show resolved Hide resolved

tests/integration/test_charm.py Show resolved Hide resolved

bump LIBPATCH and set log to warning level

4eb8ea7

marceloneppel approved these changes Jun 11, 2024

View reviewed changes

dragomirp approved these changes Jun 11, 2024

View reviewed changes

lucasgameiroborges merged commit c996dd3 into main Jun 11, 2024
46 checks passed

lucasgameiroborges deleted the lucas/stabilize-test branch June 11, 2024 23:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DPE-4427] Address main instability sources on backups integration tests #496

[DPE-4427] Address main instability sources on backups integration tests #496

lucasgameiroborges commented Jun 4, 2024 •

edited

Loading

lucasgameiroborges Jun 10, 2024 •

edited

Loading

lucasgameiroborges Jun 10, 2024

lucasgameiroborges commented Jun 10, 2024

taurus-forever left a comment

marceloneppel left a comment

[DPE-4427] Address main instability sources on backups integration tests #496

[DPE-4427] Address main instability sources on backups integration tests #496

Conversation

lucasgameiroborges commented Jun 4, 2024 • edited Loading

Issue

Solution

Related Follow-ups

lucasgameiroborges Jun 10, 2024 • edited Loading

Choose a reason for hiding this comment

lucasgameiroborges Jun 10, 2024

Choose a reason for hiding this comment

lucasgameiroborges commented Jun 10, 2024

taurus-forever left a comment

Choose a reason for hiding this comment

marceloneppel left a comment

Choose a reason for hiding this comment

lucasgameiroborges commented Jun 4, 2024 •

edited

Loading

lucasgameiroborges Jun 10, 2024 •

edited

Loading