Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MISC] Fixes for PITR backup test instabilities #690

Merged
merged 21 commits into from
Sep 23, 2024

Conversation

lucasgameiroborges
Copy link
Member

@lucasgameiroborges lucasgameiroborges commented Sep 9, 2024

  • This PR addresses some of the sources of instability in CI, mainly those introduced with the new PITR feature.
  • Unit testing will be implemented once you are OK with the changes.
  • See review comments for specific context.

Copy link

codecov bot commented Sep 9, 2024

Codecov Report

Attention: Patch coverage is 38.29787% with 29 lines in your changes missing coverage. Please review.

Project coverage is 70.65%. Comparing base (f203e4d) to head (a1728c9).
Report is 7 commits behind head on main.

Files with missing lines Patch % Lines
src/charm.py 40.00% 24 Missing and 3 partials ⚠️
src/upgrade.py 0.00% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #690      +/-   ##
==========================================
- Coverage   70.75%   70.65%   -0.10%     
==========================================
  Files          11       11              
  Lines        2968     2999      +31     
  Branches      517      523       +6     
==========================================
+ Hits         2100     2119      +19     
- Misses        757      767      +10     
- Partials      111      113       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@lucasgameiroborges lucasgameiroborges marked this pull request as ready for review September 17, 2024 23:29
@lucasgameiroborges lucasgameiroborges changed the title [MISC][WIP] try CI fixes [MISC] Fixes for PITR backup test instabilities Sep 17, 2024
and len(services) > 0
and not self._was_restore_successful(container, services[0])
):
logger.debug("on_peer_relation_changed early exit: Backup restore check failed")
Copy link
Member Author

@lucasgameiroborges lucasgameiroborges Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was observed in some cases that, after a PITR restore failed, the first event to run following the failure would be on_peer_relation_changed instead of update_status (where backup failure checks were located). When that happened, the charm would try to ping patroni, fail, enter awaiting for member to start status and endlessly defer, never knowing it was a backup failure. this condition is meant to avoid that.

if (
self.unit.status.message == MOVE_RESTORED_CLUSTER_TO_ANOTHER_BUCKET
and "require-change-bucket-after-restore" not in self.app_peer_data
):
Copy link
Member Author

@lucasgameiroborges lucasgameiroborges Sep 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this extra condition is meant to allow MOVE_RESTORED_CLUSTER_TO_ANOTHER_BUCKET blocked status to resolve inside update_status event, in case the flag is no longer there, avoiding potential endless block. The whole check was moved to _on_update_status_early_exit_checks to avoid complexity inside _on_update_status

self.log_pitr_last_transaction_time()
self.unit.status = BlockedStatus(CANNOT_RESTORE_PITR)
return False
if "restore-to-time" in self.app_peer_data and all(self.is_pitr_failed(container)):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

often times, in juju 3 specifically, the restore would fail (making patroni unresponsive) but the service status itself would stay active for a while. This was causing the charm to think it was just patroni not being ready on the unit and never catching the underlying failure. Depending on the ordering of events following, this could cause the charm to go into waiting status and never into blocked

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting findings!

)
patroni_exceptions = []
count = 0
while len(patroni_exceptions) == 0 and count < 10:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This way of checking the logs for failures is not very stable and often doesn't catch the errors. My changes here aim to just decrease the likelihood of false negatives, but integration tests can still fail on 1st try, specially on juju 2.9. We probably need to redo this check using another approach.

@@ -194,9 +194,15 @@ async def test_backup_aws(ops_test: OpsTest, cloud_configs: Tuple[Dict, Dict]) -
await ops_test.model.wait_for_idle(status="active", timeout=1000)

# Remove the database app.
await ops_test.model.remove_application(database_app_name, block_until_done=True)
await ops_test.model.remove_application(database_app_name)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change (and the other similar ones) aim to avoid the case where, for some reason, the application removal gets stuck and never returns, causing the test to hang for 2h (until CI job itself times out). Because remove_application() does not have a timeout parameter on juju 2.9, I used the regular block_until with timeout.

@@ -297,7 +309,7 @@ async def test_restore_on_new_cluster(ops_test: OpsTest, github_secrets) -> None
database_app_name,
0,
S3_INTEGRATOR_APP_NAME,
MOVE_RESTORED_CLUSTER_TO_ANOTHER_BUCKET,
ANOTHER_CLUSTER_REPOSITORY_ERROR_MESSAGE,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverting the change made on PITR PR which was not fully appropriate (possibly motivated by the issue observed in https://github.com/canonical/postgresql-k8s-operator/pull/690/files#r1764218285). Double checked with @marceloneppel here.

apps=[DATABASE_APP_NAME],
status="active",
raise_on_blocked=True,
raise_on_error=False,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

raise_on_error=False missing here, another one below.

@@ -744,7 +744,7 @@ async def switchover(
candidate: The unit that should be elected the new primary.
"""
primary_ip = await get_unit_address(ops_test, current_primary)
for attempt in Retrying(stop=stop_after_attempt(4), wait=wait_fixed(5), reraise=True):
for attempt in Retrying(stop=stop_after_attempt(60), wait=wait_fixed(3), reraise=True):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably a bit overkill, but this was affecting test_backups which is an annoying test to retry and these values helped.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, do you have some logs from a failure related to the previous values?

Copy link
Member Author

@lucasgameiroborges lucasgameiroborges Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this PR commits there are 2 examples:
https://github.com/canonical/postgresql-k8s-operator/actions/runs/10894200338/job/30260414050#step:31:395

https://github.com/canonical/postgresql-k8s-operator/actions/runs/10893225406/job/30227925207#step:31:491

Another case, that I didn't find a quick example but I've seen it, is when the assert for status 200 failed during retries because patroni returns 412: no valid candidates for switchover

Copy link
Member

@marceloneppel marceloneppel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing work, Lucas! Thanks for the improvements.

I left one request for a possible change in the way we check for a failure in PITR.

@@ -744,7 +744,7 @@ async def switchover(
candidate: The unit that should be elected the new primary.
"""
primary_ip = await get_unit_address(ops_test, current_primary)
for attempt in Retrying(stop=stop_after_attempt(4), wait=wait_fixed(5), reraise=True):
for attempt in Retrying(stop=stop_after_attempt(60), wait=wait_fixed(3), reraise=True):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, do you have some logs from a failure related to the previous values?

src/charm.py Show resolved Hide resolved
self.log_pitr_last_transaction_time()
self.unit.status = BlockedStatus(CANNOT_RESTORE_PITR)
return False
if "restore-to-time" in self.app_peer_data and all(self.is_pitr_failed(container)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting findings!

Comment on lines +1447 to +1450
if (
service.current != ServiceStatus.ACTIVE
and self.unit.status.message != CANNOT_RESTORE_PITR
):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suspect that this block is being reached earlier than the one above on Juju 2.9 because of the status message (Failed to restore backup) of the unit in the failed tests:

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That status is set here after the PITR error check, so I think the reason for this block running before the one above is because this check failed when it shoud've succeeded.

Anyway, I agree that this should be investigated in another PR, probably won't be a quick fix unfortunately.

Copy link
Member

@marceloneppel marceloneppel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks a lot, Lucas!

src/charm.py Outdated Show resolved Hide resolved
Signed-off-by: Lucas Gameiro Borges <lucas.borges@canonical.com>
@Zvirovyi
Copy link
Contributor

Great findings and improvements! Actually, this PR will fix #622

Copy link
Contributor

@taurus-forever taurus-forever left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow. This is an excellent stabilization progress!

@lucasgameiroborges lucasgameiroborges merged commit 3d1b508 into main Sep 23, 2024
96 checks passed
@lucasgameiroborges lucasgameiroborges deleted the lucas/investigate-ci branch September 23, 2024 12:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants