[MISC] Fixes for PITR backup test instabilities #690

lucasgameiroborges · 2024-09-09T19:01:53Z

This PR addresses some of the sources of instability in CI, mainly those introduced with the new PITR feature.
Unit testing will be implemented once you are OK with the changes.
See review comments for specific context.

codecov · 2024-09-09T19:04:00Z

Codecov Report

Attention: Patch coverage is 38.29787% with 29 lines in your changes missing coverage. Please review.

Project coverage is 70.65%. Comparing base (f203e4d) to head (a1728c9).
Report is 7 commits behind head on main.

Files with missing lines	Patch %	Lines
src/charm.py	40.00%	24 Missing and 3 partials ⚠️
src/upgrade.py	0.00%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #690      +/-   ##
==========================================
- Coverage   70.75%   70.65%   -0.10%     
==========================================
  Files          11       11              
  Lines        2968     2999      +31     
  Branches      517      523       +6     
==========================================
+ Hits         2100     2119      +19     
- Misses        757      767      +10     
- Partials      111      113       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

lucasgameiroborges · 2024-09-17T23:37:37Z

src/charm.py

+            and len(services) > 0
+            and not self._was_restore_successful(container, services[0])
+        ):
+            logger.debug("on_peer_relation_changed early exit: Backup restore check failed")


It was observed in some cases that, after a PITR restore failed, the first event to run following the failure would be on_peer_relation_changed instead of update_status (where backup failure checks were located). When that happened, the charm would try to ping patroni, fail, enter awaiting for member to start status and endlessly defer, never knowing it was a backup failure. this condition is meant to avoid that.

lucasgameiroborges · 2024-09-17T23:43:47Z

src/charm.py

+            if (
+                self.unit.status.message == MOVE_RESTORED_CLUSTER_TO_ANOTHER_BUCKET
+                and "require-change-bucket-after-restore" not in self.app_peer_data
+            ):


this extra condition is meant to allow MOVE_RESTORED_CLUSTER_TO_ANOTHER_BUCKET blocked status to resolve inside update_status event, in case the flag is no longer there, avoiding potential endless block. The whole check was moved to _on_update_status_early_exit_checks to avoid complexity inside _on_update_status

lucasgameiroborges · 2024-09-17T23:54:40Z

src/charm.py

-                self.log_pitr_last_transaction_time()
-                self.unit.status = BlockedStatus(CANNOT_RESTORE_PITR)
-                return False
+        if "restore-to-time" in self.app_peer_data and all(self.is_pitr_failed(container)):


often times, in juju 3 specifically, the restore would fail (making patroni unresponsive) but the service status itself would stay active for a while. This was causing the charm to think it was just patroni not being ready on the unit and never catching the underlying failure. Depending on the ordering of events following, this could cause the charm to go into waiting status and never into blocked

Interesting findings!

lucasgameiroborges · 2024-09-18T00:12:51Z

src/charm.py

-            )
+        patroni_exceptions = []
+        count = 0
+        while len(patroni_exceptions) == 0 and count < 10:


This way of checking the logs for failures is not very stable and often doesn't catch the errors. My changes here aim to just decrease the likelihood of false negatives, but integration tests can still fail on 1st try, specially on juju 2.9. We probably need to redo this check using another approach.

lucasgameiroborges · 2024-09-18T00:15:00Z

tests/integration/test_backups.py

@@ -194,9 +194,15 @@ async def test_backup_aws(ops_test: OpsTest, cloud_configs: Tuple[Dict, Dict]) -
        await ops_test.model.wait_for_idle(status="active", timeout=1000)

    # Remove the database app.
-    await ops_test.model.remove_application(database_app_name, block_until_done=True)
+    await ops_test.model.remove_application(database_app_name)


this change (and the other similar ones) aim to avoid the case where, for some reason, the application removal gets stuck and never returns, causing the test to hang for 2h (until CI job itself times out). Because remove_application() does not have a timeout parameter on juju 2.9, I used the regular block_until with timeout.

lucasgameiroborges · 2024-09-18T00:17:11Z

tests/integration/test_backups.py

@@ -297,7 +309,7 @@ async def test_restore_on_new_cluster(ops_test: OpsTest, github_secrets) -> None
            database_app_name,
            0,
            S3_INTEGRATOR_APP_NAME,
-            MOVE_RESTORED_CLUSTER_TO_ANOTHER_BUCKET,
+            ANOTHER_CLUSTER_REPOSITORY_ERROR_MESSAGE,


Reverting the change made on PITR PR which was not fully appropriate (possibly motivated by the issue observed in https://github.com/canonical/postgresql-k8s-operator/pull/690/files#r1764218285). Double checked with @marceloneppel here.

lucasgameiroborges · 2024-09-18T00:17:58Z

tests/integration/new_relations/test_new_relations.py

+            apps=[DATABASE_APP_NAME],
+            status="active",
+            raise_on_blocked=True,
+            raise_on_error=False,


raise_on_error=False missing here, another one below.

lucasgameiroborges · 2024-09-18T00:19:42Z

tests/integration/helpers.py

@@ -744,7 +744,7 @@ async def switchover(
        candidate: The unit that should be elected the new primary.
    """
    primary_ip = await get_unit_address(ops_test, current_primary)
-    for attempt in Retrying(stop=stop_after_attempt(4), wait=wait_fixed(5), reraise=True):
+    for attempt in Retrying(stop=stop_after_attempt(60), wait=wait_fixed(3), reraise=True):


probably a bit overkill, but this was affecting test_backups which is an annoying test to retry and these values helped.

Out of curiosity, do you have some logs from a failure related to the previous values?

In this PR commits there are 2 examples:
https://github.com/canonical/postgresql-k8s-operator/actions/runs/10894200338/job/30260414050#step:31:395

https://github.com/canonical/postgresql-k8s-operator/actions/runs/10893225406/job/30227925207#step:31:491

Another case, that I didn't find a quick example but I've seen it, is when the assert for status 200 failed during retries because patroni returns 412: no valid candidates for switchover

marceloneppel

Amazing work, Lucas! Thanks for the improvements.

I left one request for a possible change in the way we check for a failure in PITR.

marceloneppel · 2024-09-18T18:37:33Z

tests/integration/helpers.py

@@ -744,7 +744,7 @@ async def switchover(
        candidate: The unit that should be elected the new primary.
    """
    primary_ip = await get_unit_address(ops_test, current_primary)
-    for attempt in Retrying(stop=stop_after_attempt(4), wait=wait_fixed(5), reraise=True):
+    for attempt in Retrying(stop=stop_after_attempt(60), wait=wait_fixed(3), reraise=True):


Out of curiosity, do you have some logs from a failure related to the previous values?

src/charm.py

marceloneppel · 2024-09-18T20:51:35Z

src/charm.py

-                self.log_pitr_last_transaction_time()
-                self.unit.status = BlockedStatus(CANNOT_RESTORE_PITR)
-                return False
+        if "restore-to-time" in self.app_peer_data and all(self.is_pitr_failed(container)):


Interesting findings!

marceloneppel · 2024-09-18T21:01:58Z

src/charm.py

+        if (
+            service.current != ServiceStatus.ACTIVE
+            and self.unit.status.message != CANNOT_RESTORE_PITR
+        ):


I suspect that this block is being reached earlier than the one above on Juju 2.9 because of the status message (Failed to restore backup) of the unit in the failed tests:

https://github.com/canonical/postgresql-k8s-operator/actions/runs/10911425853/job/30284387686#step:31:699

https://github.com/canonical/postgresql-k8s-operator/actions/runs/10911425853/job/30284388001#step:31:214
As I commented before on MM, the fix for that may be investigated later in another PR.

That status is set here after the PITR error check, so I think the reason for this block running before the one above is because this check failed when it shoud've succeeded.

Anyway, I agree that this should be investigated in another PR, probably won't be a quick fix unfortunately.

marceloneppel

LGTM! Thanks a lot, Lucas!

src/charm.py

Signed-off-by: Lucas Gameiro Borges <lucas.borges@canonical.com>

Zvirovyi · 2024-09-22T08:40:59Z

Great findings and improvements! Actually, this PR will fix #622

taurus-forever

Wow. This is an excellent stabilization progress!

try CI fixes

82dd969

github-actions bot added the Libraries: OK label Sep 9, 2024

lucasgameiroborges added 6 commits September 9, 2024 17:50

reformat waits

49e0dfa

fix test charm + nits

cd6285d

try fix wrong blocked messages

37c1d14

add wildcard

cb598c5

try another command

caa74c6

Merge branch 'main' into lucas/investigate-ci

5a565b3

github-actions bot added Libraries: Out of sync and removed Libraries: OK labels Sep 11, 2024

lucasgameiroborges added 3 commits September 11, 2024 13:28

small nits

e06a2ac

reposition pgdata storage check

b98b334

Merge remote-tracking branch 'origin/main' into lucas/investigate-ci

fecb91e

github-actions bot added Libraries: OK and removed Libraries: Out of sync labels Sep 16, 2024

lucasgameiroborges added 6 commits September 16, 2024 15:59

try fixing PITR checks

8ebe27e

check PITR fails even if service is active

2ebfa87

adapt on peer relation changed hook

f730aad

retry pitr fail check

2be8a38

dont wait for remove relations in test_backups

a11e374

try fixing restore fail catching

90a9c88

github-actions bot added Libraries: Out of sync and removed Libraries: OK labels Sep 17, 2024

lucasgameiroborges added 3 commits September 17, 2024 12:42

check all pg logs on pitr fail check

4c1fd93

revert previous error message

435d939

final nits

ca592f7

lucasgameiroborges marked this pull request as ready for review September 17, 2024 23:29

lucasgameiroborges changed the title ~~[MISC][WIP] try CI fixes~~ [MISC] Fixes for PITR backup test instabilities Sep 17, 2024

lucasgameiroborges commented Sep 17, 2024

View reviewed changes

lucasgameiroborges commented Sep 18, 2024

View reviewed changes

lucasgameiroborges requested review from marceloneppel, taurus-forever and dragomirp September 18, 2024 00:21

marceloneppel reviewed Sep 18, 2024

View reviewed changes

lucasgameiroborges requested a review from marceloneppel September 19, 2024 14:35

marceloneppel approved these changes Sep 19, 2024

View reviewed changes

dragomirp reviewed Sep 19, 2024

View reviewed changes

src/charm.py Outdated Show resolved Hide resolved

dragomirp approved these changes Sep 19, 2024

View reviewed changes

lucasgameiroborges added 2 commits September 20, 2024 13:32

use walrus check

a79a89f

Signed-off-by: Lucas Gameiro Borges <lucas.borges@canonical.com>

add coverage

a1728c9

taurus-forever approved these changes Sep 22, 2024

View reviewed changes

lucasgameiroborges merged commit 3d1b508 into main Sep 23, 2024
96 checks passed

lucasgameiroborges deleted the lucas/investigate-ci branch September 23, 2024 12:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MISC] Fixes for PITR backup test instabilities #690

[MISC] Fixes for PITR backup test instabilities #690

lucasgameiroborges commented Sep 9, 2024 •

edited

Loading

codecov bot commented Sep 9, 2024 •

edited

Loading

lucasgameiroborges Sep 17, 2024 •

edited

Loading

lucasgameiroborges Sep 17, 2024 •

edited

Loading

lucasgameiroborges Sep 17, 2024

marceloneppel Sep 18, 2024

lucasgameiroborges Sep 18, 2024

lucasgameiroborges Sep 18, 2024

lucasgameiroborges Sep 18, 2024

lucasgameiroborges Sep 18, 2024

lucasgameiroborges Sep 18, 2024

marceloneppel Sep 18, 2024

lucasgameiroborges Sep 19, 2024 •

edited

Loading

marceloneppel left a comment

marceloneppel Sep 18, 2024

marceloneppel Sep 18, 2024

marceloneppel Sep 18, 2024

lucasgameiroborges Sep 19, 2024

marceloneppel left a comment

Zvirovyi commented Sep 22, 2024

taurus-forever left a comment

[MISC] Fixes for PITR backup test instabilities #690

[MISC] Fixes for PITR backup test instabilities #690

Conversation

lucasgameiroborges commented Sep 9, 2024 • edited Loading

codecov bot commented Sep 9, 2024 • edited Loading

Codecov Report

lucasgameiroborges Sep 17, 2024 • edited Loading

Choose a reason for hiding this comment

lucasgameiroborges Sep 17, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lucasgameiroborges Sep 19, 2024 • edited Loading

Choose a reason for hiding this comment

marceloneppel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marceloneppel left a comment

Choose a reason for hiding this comment

Zvirovyi commented Sep 22, 2024

taurus-forever left a comment

Choose a reason for hiding this comment

lucasgameiroborges commented Sep 9, 2024 •

edited

Loading

codecov bot commented Sep 9, 2024 •

edited

Loading

lucasgameiroborges Sep 17, 2024 •

edited

Loading

lucasgameiroborges Sep 17, 2024 •

edited

Loading

lucasgameiroborges Sep 19, 2024 •

edited

Loading