Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instance: Fix containers not always starting up after host reboot #13700

Merged
merged 3 commits into from
Jul 4, 2024

Conversation

tomponline
Copy link
Member

@tomponline tomponline commented Jul 4, 2024

Ignore liblxc.ErrNotRunning error when stopping.

If the container refuses to stop, then check if the error is ErrNotRunning, and if so ignore it, because sometimes if an earlier shutdown request was sent, but timed out, the actual guest shutdown can still be proceeding and the container may have reached a stop state by now and is in the process of running the onStop hook to cleanup host side devices. If we returned here with ErrNotRunning then this would be incorrect as the onStop hook could still be running and we aren't fully cleaned up yet, which can cause issues with state reporting after Stop has returned.

…p after host reboot

Ignore liblxc.ErrNotRunning error when stopping.

If the container refuses to stop, then check if the error is ErrNotRunning, and if so ignore it, because
sometimes if an earlier shutdown request was sent, but timed out, the actual guest shutdown can still be
proceeding and the container may have reached a stop state by now and is in the process of running the
onStop hook to cleanup host side devices. If we returned here with ErrNotRunning then this would be
incorrect as the onStop hook could still be running and we aren't fully cleaned up yet, which can cause
issues with state reporting after Stop has returned.

Signed-off-by: Thomas Parrott <thomas.parrott@canonical.com>
@tomponline tomponline self-assigned this Jul 4, 2024
@tomponline tomponline marked this pull request as ready for review July 4, 2024 09:06
Copy link
Member

@MusicDin MusicDin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cluster tests are failing

@tomponline
Copy link
Member Author

Thanks @MusicDin looks like an issue when the cluster DB is offline and we try and shutdown the last LXD member:

time="2024-07-04T08:43:07Z" level=info msg="Stopping instances" stopPriority=0
time="2024-07-04T08:43:07Z" level=info msg="Shutting down instance" action=shutdown created="0001-01-01 00:00:00 +0000 UTC" ephemeral=false instance=foo instanceType=container project=default timeout=30s used="0001-01-01 00:00:00 +0000 UTC"
time="2024-07-04T08:43:09Z" level=warning msg="Failed loading instance from database, trying backup file" err="Failed to begin transaction: sql: database is closed" instance=foo project=default
time="2024-07-04T08:43:09Z" level=error msg="The stopns hook failed to load" err="Failed to begin transaction: sql: database is closed"
time="2024-07-04T08:43:09Z" level=warning msg="Failed loading instance from database, trying backup file" err="Failed to begin transaction: sql: database is closed" instance=foo project=default
time="2024-07-04T08:43:09Z" level=error msg="The stop hook failed to load" err="Failed to begin transaction: sql: database is closed"
time="2024-07-04T08:43:37Z" level=warning msg="Failed shutting down instance, forcefully stopping" err="Failed shutting down instance, status is \"Running\": context deadline exceeded" instance=foo project=default
time="2024-07-04T08:43:37Z" level=info msg="Stopping instance" action=stop created="0001-01-01 00:00:00 +0000 UTC" ephemeral=false instance=foo instanceType=container project=default stateful=false used="0001-01-01 00:00:00 +0000 UTC"

LoadFromBackup was originally intended to work without a DB, however over time it
has been accidentally modified to depend on the DB again.

Now the onStop hook issue has been resolved that has highlighted this issue.

Signed-off-by: Thomas Parrott <thomas.parrott@canonical.com>
And improved error messages.

Signed-off-by: Thomas Parrott <thomas.parrott@canonical.com>
@tomponline tomponline requested a review from MusicDin July 4, 2024 11:55
@tomponline
Copy link
Member Author

That fixed it!

Copy link
Member

@MusicDin MusicDin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This looks good, especially having less DB queries.

Just curious whether it occur that the database contains different expanded config compared to what is in backup file? Though, I cannot think of any scenario

Copy link
Member

@simondeziel simondeziel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thank you!

@tomponline tomponline merged commit 88c2440 into canonical:main Jul 4, 2024
28 checks passed
@tomponline tomponline deleted the tp-liblxc-stop branch July 4, 2024 12:33
@tomponline
Copy link
Member Author

Just curious whether it occur that the database contains different expanded config compared to what is in backup file? Though, I cannot think of any scenario

Its possible, but unlikely, as the backup file is written from the DB at each start up or config change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants