-
Notifications
You must be signed in to change notification settings - Fork 522
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
host-ctr, host-containers: proper restarts #1230
host-ctr, host-containers: proper restarts #1230
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a spelling nit.
☃️
8ef22e3
to
c00eb70
Compare
Push above adds a condition during clean up to not delete the task and container if we're returning due to host-containerd closing its connection. The deletions will fail anyways so we should not even attempt deletion. |
c00eb70
to
7fa1683
Compare
Push above reverts the previous force push. I realized that the change wasn't working in the way I expected it to. Also addresses typo pointed out by @zmrow 's comment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. One probing question.
@etungsten wrote
(it won't let me reply inline for some reason) This guidance applies to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally LGTM with one small requested change around adding another log line. With the introduction of this PR, the next change is about time to break up the _main
function into smaller bits.
7fa1683
to
43ad5dd
Compare
// Check if the target container already exists. If it does, take over the helm to manage it. | ||
container, err := client.LoadContainer(ctx, containerID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously we handled the case where we modify the image for a host container that's currently running, since we'd always kill it and restart afterwards.
Now it looks like if we already have "admin" running, we'll continue using it even if the image has changed. I don't think that's the behavior we want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also handle toggling superpowered on and off.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the image for the host container is changed via settings. Then the corresponding host-containers@
service will be restarted by systemd via restart-commands = ["/usr/bin/host-containers"]
. This means host-ctr
will actually receive a termination signal and proceed to try and terminate the container task and restart. That hasn't changed.
What this is trying to address is host-ctr
's connection to containerd being closed off suddenly and losing track of the container task status and coming back up to try and reclaim the original running host container task.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I guess looking at the code, it seems like to actually apply the new host containers setting (whether it be toggling superpowered or changing the image URL) users would have to toggle the enable
setting as well. host-containers
(the binary doesn't actually handle restarts?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't seem robust to the case where host-ctr
itself dies for some reason, gets restarted, and finds a running container - which may not be running with the right settings. Previously we did handle this correctly, at the cost of tearing down a running container that was otherwise fine.
Can we inspect the running container and determine that it's correct based on the image and one of the superpowered properties from the spec?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, changing the image or toggling superpowered doesn't restart the affected host container today. 😞
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There might be another edge case to consider - if settings for host-containerd are changed in the same transaction that disables a running host container, then host-ctr
will exit after losing the connection, and not be restarted afterwards to clean up the running task.
One approach might be for host-ctr
to have a cleanup mode so that it can be invoked by host-containers
to delete the running task, if present.
That would handle the edge case and avoid the need to inspect the running container, because it wouldn't be running if the settings had changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Push above adds additional commits to address concerns about edge cases with host-ctr potentially rebinding to an out-of-date host container when both host-containers and host-containerd are restarted.
|
Testing done for verifying the fix for the edge case:
|
I noticed while making these new changes that
I did not try to implement this here because it's not strictly within the scope of this PR. But I can make an issue to follow up with this if people think the approach above is sensible. Please let me know what you think @tjkirch, @bcressey . |
^ I think it could be better to improve the way restart-commands are run so that they're always given the list of settings that have changed, rather than the setting name having to be in the restart-command itself. That way we can handle dynamic settings, and not have to inspect the system to see what changed. For example, right now, (I just mean this as a different potential follow-up, not as a blocker for this PR) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we separate the refactoring change from the rest of this PR?
I'm inclined to keep it as part of this PR since we're adding an additional "subcommand-like" functionality to Hopefully the refactoring changes aren't too controversial and produce too much churn. It's mostly refactoring stuff into functions. |
It might be easier to pull it out into a separate PR that we merge ahead of this one. I don't expect breaking into functions to be controversial, but I'm expecting that I won't be the only one with an opinion on the subcommand implementation and it might reduce churn/rebasing effort for you to do it that way. And separating the functional changes (proper restart behavior) from the refactor should make both easier to review. |
This PR now depends on #1235 being merged before proceeding. Will rebase once it does. |
host-containers@ systemd units should not stop when host-containerd is restarted or killed. By changing host-containers' dependency on host-containerd.service from `BindsTo=` to `Wants=` we ensure host containers tasks won't be killed if host-containerd temporarily stops. host-ctr then has a chance to reclaim the container task when host-containerd comes back up.
If the host-container already exists, we should just take over the helm and not try to replace it with a new container. This is so that even if we temporarily lose connection with host-containerd, we can still eventually get the task status back when containerd comes back up.
81906dd
to
69c8fce
Compare
Push above rebases upon develop to pull in the refactor for host-ctr. Tested things and the things still work as expected, the previous test results still stands. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🎖️
Adds a new subcommand `clean-up` that checks if a given container exists, if it does, `host-ctr` will attempt to kill the container task and delete the container.
Now that host-ctr has the ability to rebind to existing host containers. We want to ensure whenever we enable host containers the container will be running with its latest configuration. We utilize `host-ctr`'s clean-up command to clean up any potential lingering host-container when we're enabling a previously disabled host-container and whenever a host-container is disabled.
69c8fce
to
d0781db
Compare
Push above drops the commit for removing |
I recommend reviewing each commit separately
Issue number:
Fixes #1229
Description of changes:
Testing done:
sudo sheltie
into the host via the admin containerhost-containers@admin
and saw that it exited and restarted successfully.host-containers@admin
initial starts successfully.This is where I restarted
host-containerd
.host-ctr
loses connection to the containerd server and exits.host-containers@admin
restarts andhost-ctr
successfully rebinds to the admin container task that's already running.Testing for when both host-containers and host-containerd are restarted due to a single API transaction. We ensure the restarted host-ctr runs an up-to-date host container that reflects the setting changes:
See #1230 (comment)
Terms of contribution:
By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.