-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DPE-4427] Address main instability sources on backups integration tests #496
Conversation
async with ops_test.fast_forward(fast_interval="60s"): | ||
await ops_test.model.wait_for_idle( | ||
apps=[database_app_name], status="active", timeout=1000 | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was added in order to avoid the charm getting stuck due to https://warthogs.atlassian.net/browse/DPE-4596. Once underlying issue is resolved, this change should get reverted, but we shouldn't let CI suffer in the meantime IMO
# Run the "create backup" action. | ||
# With a stable cluster, Run the "create backup" action | ||
async with ops_test.fast_forward(): | ||
await ops_test.model.wait_for_idle(status="active", timeout=1000, idle_period=30) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was added because, in rare cases, the charm was not idle/ready for a backup to be created.
For the record: this full CI retry was the first time I've seen our entire CI pass in one go! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me trust CI/CD here. LGTM! Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Please just update the PostgreSQL TLS library LIBPATCH
.
Issue
The
test_backups.py
integration test has been one of the main sources of CI failure in recent past. This PR aims to help stabilize the test.Solution
This PR addresses a number of root causes to the problem, namely:
DPE-4427
hook failed: "certificates-relation-changed"
: The function callpush_tls_files_to_workload()
was failing with a transient connection error after multiple retries, but tenacity'sRetryError
was not one of the catched exceptions, thus making the hook fail instead of deferring.DPE-4425 charm stuck on scale-up: Caused by an infinite defer loop on
on_peer_relation_changed
event because cluster was not initialized. Added an extra check to initialize cluster if on leader unit, which breaks the loop.DPE-2107 Switchover on scale-down: If the primary unit is the one being removed in a scale-down, the charm doesn't recover ==> implement a switchover on
storage_detaching
hook.[No ticket?] Add connection check after restart operation: implements a
_can_connect_to_postgresql
check that is verified on restart hook.[No ticket?]
update_read_only_endpoint
call may fail due to permission denied when updating databag/secret in the case of read-only cluster in the async_replication test ==> catch and log the error and move forward with the event.Related Follow-ups
test_backups.py
charm may get stuck when TLS and S3-config events happen at the same time (?): https://warthogs.atlassian.net/browse/DPE-4596test_charm.py
instability over theshared_buffers
setting: https://warthogs.atlassian.net/browse/DPE-4594test_tls.py
instability due topg_rewind
sometimes not using TLS: https://warthogs.atlassian.net/browse/DPE-4590If you see an immediate solution for the issues mentioned, please let me know so I can test it and include in the PR. Otherwise, we have tickets for further investigation.