-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MISC] Fixes for PITR backup test instabilities #690
Conversation
lucasgameiroborges
commented
Sep 9, 2024
•
edited
Loading
edited
- This PR addresses some of the sources of instability in CI, mainly those introduced with the new PITR feature.
- Unit testing will be implemented once you are OK with the changes.
- See review comments for specific context.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #690 +/- ##
==========================================
- Coverage 70.75% 70.65% -0.10%
==========================================
Files 11 11
Lines 2968 2999 +31
Branches 517 523 +6
==========================================
+ Hits 2100 2119 +19
- Misses 757 767 +10
- Partials 111 113 +2 ☔ View full report in Codecov by Sentry. |
and len(services) > 0 | ||
and not self._was_restore_successful(container, services[0]) | ||
): | ||
logger.debug("on_peer_relation_changed early exit: Backup restore check failed") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was observed in some cases that, after a PITR restore failed, the first event to run following the failure would be on_peer_relation_changed
instead of update_status
(where backup failure checks were located). When that happened, the charm would try to ping patroni, fail, enter awaiting for member to start
status and endlessly defer, never knowing it was a backup failure. this condition is meant to avoid that.
if ( | ||
self.unit.status.message == MOVE_RESTORED_CLUSTER_TO_ANOTHER_BUCKET | ||
and "require-change-bucket-after-restore" not in self.app_peer_data | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this extra condition is meant to allow MOVE_RESTORED_CLUSTER_TO_ANOTHER_BUCKET
blocked status to resolve inside update_status
event, in case the flag is no longer there, avoiding potential endless block. The whole check was moved to _on_update_status_early_exit_checks
to avoid complexity inside _on_update_status
self.log_pitr_last_transaction_time() | ||
self.unit.status = BlockedStatus(CANNOT_RESTORE_PITR) | ||
return False | ||
if "restore-to-time" in self.app_peer_data and all(self.is_pitr_failed(container)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
often times, in juju 3 specifically, the restore would fail (making patroni unresponsive) but the service status itself would stay active for a while. This was causing the charm to think it was just patroni not being ready on the unit and never catching the underlying failure. Depending on the ordering of events following, this could cause the charm to go into waiting
status and never into blocked
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting findings!
) | ||
patroni_exceptions = [] | ||
count = 0 | ||
while len(patroni_exceptions) == 0 and count < 10: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This way of checking the logs for failures is not very stable and often doesn't catch the errors. My changes here aim to just decrease the likelihood of false negatives, but integration tests can still fail on 1st try, specially on juju 2.9. We probably need to redo this check using another approach.
@@ -194,9 +194,15 @@ async def test_backup_aws(ops_test: OpsTest, cloud_configs: Tuple[Dict, Dict]) - | |||
await ops_test.model.wait_for_idle(status="active", timeout=1000) | |||
|
|||
# Remove the database app. | |||
await ops_test.model.remove_application(database_app_name, block_until_done=True) | |||
await ops_test.model.remove_application(database_app_name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this change (and the other similar ones) aim to avoid the case where, for some reason, the application removal gets stuck and never returns, causing the test to hang for 2h (until CI job itself times out). Because remove_application()
does not have a timeout parameter on juju 2.9, I used the regular block_until
with timeout.
@@ -297,7 +309,7 @@ async def test_restore_on_new_cluster(ops_test: OpsTest, github_secrets) -> None | |||
database_app_name, | |||
0, | |||
S3_INTEGRATOR_APP_NAME, | |||
MOVE_RESTORED_CLUSTER_TO_ANOTHER_BUCKET, | |||
ANOTHER_CLUSTER_REPOSITORY_ERROR_MESSAGE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverting the change made on PITR PR which was not fully appropriate (possibly motivated by the issue observed in https://github.com/canonical/postgresql-k8s-operator/pull/690/files#r1764218285). Double checked with @marceloneppel here.
apps=[DATABASE_APP_NAME], | ||
status="active", | ||
raise_on_blocked=True, | ||
raise_on_error=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
raise_on_error=False
missing here, another one below.
@@ -744,7 +744,7 @@ async def switchover( | |||
candidate: The unit that should be elected the new primary. | |||
""" | |||
primary_ip = await get_unit_address(ops_test, current_primary) | |||
for attempt in Retrying(stop=stop_after_attempt(4), wait=wait_fixed(5), reraise=True): | |||
for attempt in Retrying(stop=stop_after_attempt(60), wait=wait_fixed(3), reraise=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably a bit overkill, but this was affecting test_backups
which is an annoying test to retry and these values helped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, do you have some logs from a failure related to the previous values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this PR commits there are 2 examples:
https://github.com/canonical/postgresql-k8s-operator/actions/runs/10894200338/job/30260414050#step:31:395
Another case, that I didn't find a quick example but I've seen it, is when the assert for status 200 failed during retries because patroni returns 412: no valid candidates for switchover
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing work, Lucas! Thanks for the improvements.
I left one request for a possible change in the way we check for a failure in PITR.
@@ -744,7 +744,7 @@ async def switchover( | |||
candidate: The unit that should be elected the new primary. | |||
""" | |||
primary_ip = await get_unit_address(ops_test, current_primary) | |||
for attempt in Retrying(stop=stop_after_attempt(4), wait=wait_fixed(5), reraise=True): | |||
for attempt in Retrying(stop=stop_after_attempt(60), wait=wait_fixed(3), reraise=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, do you have some logs from a failure related to the previous values?
self.log_pitr_last_transaction_time() | ||
self.unit.status = BlockedStatus(CANNOT_RESTORE_PITR) | ||
return False | ||
if "restore-to-time" in self.app_peer_data and all(self.is_pitr_failed(container)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting findings!
if ( | ||
service.current != ServiceStatus.ACTIVE | ||
and self.unit.status.message != CANNOT_RESTORE_PITR | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suspect that this block is being reached earlier than the one above on Juju 2.9 because of the status message (Failed to restore backup
) of the unit in the failed tests:
- https://github.com/canonical/postgresql-k8s-operator/actions/runs/10911425853/job/30284387686#step:31:699
- https://github.com/canonical/postgresql-k8s-operator/actions/runs/10911425853/job/30284388001#step:31:214
As I commented before on MM, the fix for that may be investigated later in another PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That status is set here after the PITR error check, so I think the reason for this block running before the one above is because this check failed when it shoud've succeeded.
Anyway, I agree that this should be investigated in another PR, probably won't be a quick fix unfortunately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks a lot, Lucas!
Signed-off-by: Lucas Gameiro Borges <lucas.borges@canonical.com>
Great findings and improvements! Actually, this PR will fix #622 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow. This is an excellent stabilization progress!