Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DPE-4427] Address main instability sources on backups integration tests #496

Merged
merged 22 commits into from
Jun 11, 2024

Conversation

lucasgameiroborges
Copy link
Member

@lucasgameiroborges lucasgameiroborges commented Jun 4, 2024

Issue

The test_backups.py integration test has been one of the main sources of CI failure in recent past. This PR aims to help stabilize the test.

Solution

This PR addresses a number of root causes to the problem, namely:

  • DPE-4427 hook failed: "certificates-relation-changed": The function call push_tls_files_to_workload() was failing with a transient connection error after multiple retries, but tenacity's RetryError was not one of the catched exceptions, thus making the hook fail instead of deferring.

  • DPE-4425 charm stuck on scale-up: Caused by an infinite defer loop on on_peer_relation_changed event because cluster was not initialized. Added an extra check to initialize cluster if on leader unit, which breaks the loop.

  • DPE-2107 Switchover on scale-down: If the primary unit is the one being removed in a scale-down, the charm doesn't recover ==> implement a switchover on storage_detaching hook.

  • [No ticket?] Add connection check after restart operation: implements a _can_connect_to_postgresql check that is verified on restart hook.

  • [No ticket?] update_read_only_endpoint call may fail due to permission denied when updating databag/secret in the case of read-only cluster in the async_replication test ==> catch and log the error and move forward with the event.

Related Follow-ups

If you see an immediate solution for the issues mentioned, please let me know so I can test it and include in the PR. Otherwise, we have tickets for further investigation.

@lucasgameiroborges lucasgameiroborges changed the title add postgres connection check [DPE-4427] Address main instability sources on backups integration tests Jun 6, 2024
async with ops_test.fast_forward(fast_interval="60s"):
await ops_test.model.wait_for_idle(
apps=[database_app_name], status="active", timeout=1000
)
Copy link
Member Author

@lucasgameiroborges lucasgameiroborges Jun 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was added in order to avoid the charm getting stuck due to https://warthogs.atlassian.net/browse/DPE-4596. Once underlying issue is resolved, this change should get reverted, but we shouldn't let CI suffer in the meantime IMO

# Run the "create backup" action.
# With a stable cluster, Run the "create backup" action
async with ops_test.fast_forward():
await ops_test.model.wait_for_idle(status="active", timeout=1000, idle_period=30)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was added because, in rare cases, the charm was not idle/ready for a backup to be created.

src/charm.py Outdated Show resolved Hide resolved
@lucasgameiroborges lucasgameiroborges marked this pull request as ready for review June 10, 2024 16:53
@lucasgameiroborges
Copy link
Member Author

For the record: this full CI retry was the first time I've seen our entire CI pass in one go!

Copy link
Contributor

@taurus-forever taurus-forever left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me trust CI/CD here. LGTM! Thank you!

src/charm.py Outdated Show resolved Hide resolved
Copy link
Member

@marceloneppel marceloneppel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Please just update the PostgreSQL TLS library LIBPATCH.

src/charm.py Outdated Show resolved Hide resolved
tests/integration/test_charm.py Show resolved Hide resolved
@lucasgameiroborges lucasgameiroborges merged commit c996dd3 into main Jun 11, 2024
46 checks passed
@lucasgameiroborges lucasgameiroborges deleted the lucas/stabilize-test branch June 11, 2024 23:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants