Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

host-ctr, host-containers: proper restarts #1230

Merged
merged 4 commits into from
Dec 11, 2020

Conversation

etungsten
Copy link
Contributor

@etungsten etungsten commented Dec 3, 2020

I recommend reviewing each commit separately

Issue number:
Fixes #1229

Description of changes:

Author: Erikson Tung <etung@amazon.com>
Date:   Sun Dec 6 20:54:03 2020 -0800

    host-containers: add safeguards against lingering host containers
    
    Now that host-ctr has the ability to rebind to existing host containers.
    We want to ensure whenever we enable host containers the container will
    be running with its latest configuration.
    
    We utilize `host-ctr`'s clean-up command to clean up any potential
    lingering host-container when we're enabling a previously disabled
    host-container and whenever a host-container is disabled.
Author: Erikson Tung <etung@amazon.com>
Date:   Fri Dec 4 20:02:37 2020 -0800

    host-ctr: add new subcommand `clean-up`
    
    Adds a new subcommand `clean-up` that checks if a given container
    exists, if it does, `host-ctr` will attempt to kill the container task
    and delete the container.
Author: Erikson Tung <etung@amazon.com>
Date:   Fri Dec 4 18:34:07 2020 -0800

    host-ctr: refactoring
    
    Refactors `host-ctr`.
    Categorizes functionality into subcommands.

Author: Erikson Tung <etung@amazon.com>
Date:   Fri Dec 4 14:31:26 2020 -0800

    host-containers@: remove KillMode=mixed
    
    We don't need systemd to go and actively try kill all processes
    of the unit's cgroup.

Author: Erikson Tung <etung@amazon.com>
Date:   Wed Dec 2 18:18:57 2020 -0800

    host-ctr: do not kill existing container, take over it
    
    If the host-container already exists, we should just take over the
    helm and not try to replace it with a new container. This is so that
    even if we temporarily lose connection with host-containerd, we can
    still eventually get the task status when containerd comes back up.

commit 0dbc03cdcde435aa259f34d80f6ec66a202b45a4
Author: Erikson Tung <etung@amazon.com>
Date:   Wed Dec 2 14:45:44 2020 -0800

    host-containers: 'Wants' host-containerd instead of 'BindsTo'
    
    host-containers@ systemd units should not stop when host-containerd is
    restarted or killed.
    
    By changing host-containers' dependency on host-containerd.service from
    `BindsTo=` to `Wants=` we ensure host containers tasks won't be killed if
     host-containerd temporarily stops.
    
    host-ctr then has a chance to reclaim the container task when
    host-containerd comes back up.

Testing done:

  • Built AMI, launched instance
  • sudo sheltie into the host via the admin container
  • Restarted host-containerd, and my ssh connection to the admin container was still alive
  • Checked the status of host-containers@admin and saw that it exited and restarted successfully.

host-containers@admin initial starts successfully.

Dec 03 03:09:30 host-ctr[3077]: Server listening on 0.0.0.0 port 22.
Dec 03 03:09:30 host-ctr[3077]: Server listening on :: port 22.
Dec 03 03:10:48 host-ctr[3077]: Accepted publickey for ec2-user from 123.123.123.123 port 5417 ssh2

This is where I restarted host-containerd. host-ctr loses connection to the containerd server and exits.

Dec 03 03:11:28 host-ctr[3077]: time="2020-12-03T03:11:28Z" level=error msg="failed to get container task
 exit status" error="rpc error: code = Unavailable desc = transport is closing"
Dec 03 03:11:28 host-ctr[3077]: time="2020-12-03T03:11:28Z" level=error msg="failed to delete container t
ask" error="task must be stopped before deletion: running: failed precondition"
Dec 03 03:11:28 host-ctr[3077]: time="2020-12-03T03:11:28Z" level=error msg="failed to cleanup container"
 error="cannot delete running task admin: failed precondition"
Dec 03 03:11:28 systemd[1]: host-containers@admin.service: Main process exited, c
ode=exited, status=1/FAILURE
Dec 03 03:11:28 systemd[1]: host-containers@admin.service: Failed wit
h result 'exit-code'.

host-containers@admin restarts and host-ctr successfully rebinds to the admin container task that's already running.

Dec 03 03:12:14 systemd[1]: host-containers@admin.service: Scheduled restart job, restart counter is at 1
.
Dec 03 03:12:14 systemd[1]: Stopped Host container: admin.
Dec 03 03:12:14 systemd[1]: Starting Host container: admin...
Dec 03 03:12:14 systemd[1]: Started Host container: admin.
Dec 03 03:12:14 host-ctr[5096]: time="2020-12-03T03:12:14Z" level=info msg="Pulling with Amazon ECR Resol
ver" ref="ecr.aws/arn:aws:ecr:us-west-2:328549459982:repository/bottlerocket-admin:v0.5.2"
Dec 03 03:12:14 host-ctr[5096]: time="2020-12-03T03:12:14Z" level=info msg="Pulled successfully" img="ecr
.aws/arn:aws:ecr:us-west-2:328549459982:repository/bottlerocket-admin:v0.5.2"
Dec 03 03:12:14 host-ctr[5096]: time="2020-12-03T03:12:14Z" level=info msg=Unpacking... img="ecr.aws/arn:
aws:ecr:us-west-2:328549459982:repository/bottlerocket-admin:v0.5.2"
Dec 03 03:12:14 host-ctr[5096]: time="2020-12-03T03:12:14Z" level=info msg="Tagging image" imageName="328
549459982.dkr.ecr.us-west-2.amazonaws.com/bottlerocket-admin:v0.5.2"
Dec 03 03:12:14 host-ctr[5096]: time="2020-12-03T03:12:14Z" level=info msg="Container task is still runni
ng, proceeding to monitor it"

Testing for when both host-containers and host-containerd are restarted due to a single API transaction. We ensure the restarted host-ctr runs an up-to-date host container that reflects the setting changes:
See #1230 (comment)

Terms of contribution:

By submitting this pull request, I agree that this contribution is dual-licensed under the terms of both the Apache License, version 2.0, and the MIT license.

Copy link
Contributor

@zmrow zmrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a spelling nit.

☃️

sources/host-ctr/cmd/host-ctr/main.go Outdated Show resolved Hide resolved
sources/host-ctr/cmd/host-ctr/main.go Show resolved Hide resolved
@etungsten
Copy link
Contributor Author

Push above adds a condition during clean up to not delete the task and container if we're returning due to host-containerd closing its connection. The deletions will fail anyways so we should not even attempt deletion.

@etungsten
Copy link
Contributor Author

Push above reverts the previous force push. I realized that the change wasn't working in the way I expected it to.

Also addresses typo pointed out by @zmrow 's comment.

Copy link
Contributor

@webern webern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. One probing question.

sources/host-ctr/cmd/host-ctr/main.go Show resolved Hide resolved
sources/host-ctr/cmd/host-ctr/main.go Show resolved Hide resolved
@samuelkarp
Copy link
Contributor

@etungsten wrote

I was trying to follow the golang style guide regarding error strings golang/go/wiki/CodeReviewComments#error-strings.

It also explicit says that this doesn't apply to normal log messages. So I kept info logs the same.

(it won't let me reply inline for some reason)

This guidance applies to the error type, not to the log output (thought it's intended to improve log output when errors are appended onto log lines for context). I don't have any problem with the changes you made, but I did want to clarify that.

Copy link
Contributor

@samuelkarp samuelkarp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally LGTM with one small requested change around adding another log line. With the introduction of this PR, the next change is about time to break up the _main function into smaller bits.

packages/os/host-containers@.service Show resolved Hide resolved
sources/host-ctr/cmd/host-ctr/main.go Show resolved Hide resolved
sources/host-ctr/cmd/host-ctr/main.go Outdated Show resolved Hide resolved
sources/host-ctr/cmd/host-ctr/main.go Show resolved Hide resolved
Comment on lines +167 to +192
// Check if the target container already exists. If it does, take over the helm to manage it.
container, err := client.LoadContainer(ctx, containerID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously we handled the case where we modify the image for a host container that's currently running, since we'd always kill it and restart afterwards.

Now it looks like if we already have "admin" running, we'll continue using it even if the image has changed. I don't think that's the behavior we want.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also handle toggling superpowered on and off.

Copy link
Contributor Author

@etungsten etungsten Dec 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the image for the host container is changed via settings. Then the corresponding host-containers@ service will be restarted by systemd via restart-commands = ["/usr/bin/host-containers"] . This means host-ctr will actually receive a termination signal and proceed to try and terminate the container task and restart. That hasn't changed.

What this is trying to address is host-ctr's connection to containerd being closed off suddenly and losing track of the container task status and coming back up to try and reclaim the original running host container task.

Copy link
Contributor Author

@etungsten etungsten Dec 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I guess looking at the code, it seems like to actually apply the new host containers setting (whether it be toggling superpowered or changing the image URL) users would have to toggle the enable setting as well. host-containers (the binary doesn't actually handle restarts?)

Is that the intended workflow? @zmrow @tjkirch ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem robust to the case where host-ctr itself dies for some reason, gets restarted, and finds a running container - which may not be running with the right settings. Previously we did handle this correctly, at the cost of tearing down a running container that was otherwise fine.

Can we inspect the running container and determine that it's correct based on the image and one of the superpowered properties from the spec?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, changing the image or toggling superpowered doesn't restart the affected host container today. 😞

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be another edge case to consider - if settings for host-containerd are changed in the same transaction that disables a running host container, then host-ctr will exit after losing the connection, and not be restarted afterwards to clean up the running task.

One approach might be for host-ctr to have a cleanup mode so that it can be invoked by host-containers to delete the running task, if present.

That would handle the edge case and avoid the need to inspect the running container, because it wouldn't be running if the settings had changed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that the intended workflow? @zmrow @tjkirch ?

(Just for historical perspective - it was intended, but we didn't like it :) We didn't have the tools at the time to restart properly on those settings changes and considered it a weakness.)

@etungsten etungsten changed the title host-ctr: proper restarts host-ctr, host-containers: proper restarts Dec 7, 2020
@etungsten
Copy link
Contributor Author

Push above adds additional commits to address concerns about edge cases with host-ctr potentially rebinding to an out-of-date host container when both host-containers and host-containerd are restarted.

  • A new clean-up mode has been added to host-ctr
  • Refactored host-ctr to accommodate the new subcommand.
  • host-containers (the rust binary) will now call host-ctr clean-up when appropriate to ensure restarted host-containers are running with the latest settings/configuration.
  • Removes KillMode=mixed from host-containers@ units

@etungsten
Copy link
Contributor Author

etungsten commented Dec 7, 2020

Testing done for verifying the fix for the edge case:

host-ctr no longer rebinds to lingerering out-of-date host-containers when both host-containers and host-containerd are restarted.

bash-5.0# 
bash-5.0# ############# Disable host-containerd
bash-5.0# 
bash-5.0# systemctl restart host-containerd
bash-5.0# sleep 1
bash-5.0# 
bash-5.0# 
bash-5.0# ############# host-ctr loses connection and there is now a lingering control host container running
bash-5.0# 
bash-5.0# systemctl status host-containers@control
● host-containers@control.service - Host container: control
     Loaded: loaded (/x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/host-containers@.service;
 enabled; vendor preset: enabled)
     Active: activating (auto-restart) (Result: exit-code) since Mon 2020-12-07 06:35:21 UTC; 992ms ago
    Process: 3020 ExecStartPre=/usr/bin/mkdir -m 1777 -p ${LOCAL_DIR}/host-containers/control (code=exited, s
tatus=0/SUCCESS)
    Process: 3080 ExecStart=/usr/bin/host-ctr run -container-id=control -source=${CTR_SOURCE} -superpowered=$
{CTR_SUPERPOWERED} (code=exited, status=1/FAILURE)
   Main PID: 3080 (code=exited, status=1/FAILURE)

Dec 07 06:35:21 ip-192-168-28-193.us-west-2.compute.internal host-ctr[3080]: time="2020-12-07T06:35:21Z" leve
l=error msg="failed to cleanup container" error="cannot delete running task control: failed precondition"
Dec 07 06:35:21 ip-192-168-28-193.us-west-2.compute.internal systemd[1]: host-contain
ers@control.service: Main process exited, code=exited, status=1/FAILURE
Dec 07 06:35:21 ip-192-168-28-193.us-west-2.compute.internal systemd[1]: 
host-containers@control.service: Failed with result 'exit-code'.
bash-5.0# 
bash-5.0# ctr -a /run/host-containerd/containerd.sock task ls
TASK       PID     STATUS    
control    3739    RUNNING
admin      3823    RUNNING
bash-5.0# 
bash-5.0# ctr -a /run/host-containerd/containerd.sock container ls
CONTAINER    IMAGE                                                                                RUNTIME                  
admin        ecr.aws/arn:aws:ecr:us-west-2:328549459982:repository/bottlerocket-admin:v0.5.2      io.containerd.runc.v2    
control      ecr.aws/arn:aws:ecr:us-west-2:328549459982:repository/bottlerocket-control:v0.4.1    io.containerd.runc.v2    
bash-5.0# 
bash-5.0# 
bash-5.0# 
bash-5.0# ############# Disable host-containers@control through settings
bash-5.0# 
bash-5.0# apiclient -u /settings -m PATCH -d '{"host-containers": {"control": {"enabled": false}}}'
bash-5.0# apiclient -u /tx/commit_and_apply -m POST
["settings.host-containers.control.enabled"]
bash-5.0# sleep 3

bash-5.0# 
bash-5.0# 
bash-5.0# 
bash-5.0# ############# Restart commands succeeded 
bash-5.0# 
bash-5.0# journalctl -u apiserver -n 50
-- Logs begin at Mon 2020-12-07 06:30:11 UTC, end at Mon 2020-12-07 06:35:22 UTC. --
Dec 07 06:30:13 localhost systemd[1]: Starting Bottlerocket API server...
Dec 07 06:30:13 localhost apiserver[2487]: 06:30:13 [INFO] Starting server at /run/api.sock with 1 thread and
 datastore at /var/lib/bottlerocket/datastore/current
Dec 07 06:30:13 localhost systemd[1]: Started Bottlerocket API server.
Dec 07 06:30:13 localhost apiserver[2487]: 06:30:13 [INFO] Starting 1 workers
Dec 07 06:30:13 localhost apiserver[2487]: 06:30:13 [INFO] Starting "actix-web-service-"/run/api.sock"" servi
ce on "/run/api.sock" (pathname)
Dec 07 06:35:22 apiserver[5871]: 06:35:22 [INFO] thar-be-setting
s started
Dec 07 06:35:22 apiserver[5871]: 06:35:22 [INFO] Parsing stdin f
or updated settings
Dec 07 06:35:22 apiserver[5871]: 06:35:22 [INFO] Requesting affe
cted services for settings: {"settings.host-containers.control.enabled"}
Dec 07 06:35:22 apiserver[5871]: 06:35:22 [INFO] Restarting affe
cted services...
bash-5.0# sleep 5
bash-5.0# 
bash-5.0# 
bash-5.0# ############## host-containers@control is disabled at this point
bash-5.0# 
bash-5.0# systemctl status host-containers@control
● host-containers@control.service - Host container: control
     Loaded: loaded (/x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/host-containers@.service;
 disabled; vendor preset: enabled)
     Active: inactive (dead)

....
Dec 07 06:35:21 host-ctr[3080]: time="2020-12-07T06:35:21Z" leve
l=error msg="failed to get container task exit status" error="rpc error: code = Unavailable desc = transport 
is closing"
...
Dec 07 06:35:21 host-ctr[3080]: time="2020-12-07T06:35:21Z" level=error msg="failed to cleanup container" error="cannot delete running task control: failed precondition"
Dec 07 06:35:21 systemd[1]: host-containers@control.service: Main process exited, code=exited, status=1/FAILURE
Dec 07 06:35:21 systemd[1]: host-containers@control.service: Failed with result 'exit-code'.
Dec 07 06:35:22 systemd[1]: Stopped Host container: control.
bash-5.0# 
bash-5.0# 
bash-5.0# 
bash-5.0# ############## The lingering control host container got cleaned up by the restart command
bash-5.0# 
bash-5.0# ctr -a /run/host-containerd/containerd.sock task ls
TASK     PID     STATUS    
admin    3823    RUNNING
bash-5.0# 
bash-5.0# ctr -a /run/host-containerd/containerd.sock container ls
CONTAINER    IMAGE                                                                              RUNTIME                  
admin        ecr.aws/arn:aws:ecr:us-west-2:328549459982:repository/bottlerocket-admin:v0.5.2    io.containerd.runc.v2    
bash-5.0# 
bash-5.0# 
bash-5.0# ############## Re-enable control host-container and use a different image source for differentiation
bash-5.0# 
<459982.dkr.ecr.us-west-2.amazonaws.com/bottlerocket-control:v0.4.0"}}}'
bash-5.0# apiclient -u /tx/commit_and_apply -m POST
["settings.host-containers.control.enabled","settings.host-containers.control.source"]
bash-5.0# sleep 5
bash-5.0# 
bash-5.0# 
bash-5.0# 
bash-5.0# ############# Restart commands succeeded 
bash-5.0# 
bash-5.0# journalctl -u apiserver -n 50
-- Logs begin at Mon 2020-12-07 06:30:11 UTC, end at Mon 2020-12-07 06:35:35 UTC. --
Dec 07 06:30:13 localhost systemd[1]: Starting Bottlerocket API server...
Dec 07 06:30:13 localhost apiserver[2487]: 06:30:13 [INFO] Starting server at /run/api.sock with 1 thread and datastore at /var/lib/bottlerocket/datastore/current
Dec 07 06:30:13 localhost systemd[1]: Started Bottlerocket API server.
Dec 07 06:30:13 localhost apiserver[2487]: 06:30:13 [INFO] Starting 1 workers
Dec 07 06:30:13 localhost apiserver[2487]: 06:30:13 [INFO] Starting "actix-web-service-"/run/api.sock"" service on "/run/api.sock" (pathname)
Dec 07 06:35:22 apiserver[5871]: 06:35:22 [INFO] thar-be-settings started
Dec 07 06:35:22 apiserver[5871]: 06:35:22 [INFO] Parsing stdin for updated settings
Dec 07 06:35:22 apiserver[5871]: 06:35:22 [INFO] Requesting affected services for settings: {"settings.host-containers.control.enabled"}
Dec 07 06:35:22 apiserver[5871]: 06:35:22 [INFO] Restarting affected services...
Dec 07 06:35:30 apiserver[6037]: 06:35:30 [INFO] thar-be-settings started
Dec 07 06:35:30 apiserver[6037]: 06:35:30 [INFO] Parsing stdin for updated settings
Dec 07 06:35:30 apiserver[6037]: 06:35:30 [INFO] Requesting affected services for settings: {"settings.host-containers.control.source", "settings.host-containers.cont
rol.enabled"}
Dec 07 06:35:30 apiserver[6037]: 06:35:30 [INFO] Restarting affected services...
bash-5.0# sleep 10
bash-5.0# 
bash-5.0# 
bash-5.0# 
bash-5.0# ############# control host container running again
bash-5.0# systemctl status host-containers@control
● host-containers@control.service - Host container: control
     Loaded: loaded (/x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/host-containers@.service; enabled; vendor preset: enabled)
     Active: active (running) since Mon 2020-12-07 06:35:30 UTC; 14s ago
    Process: 6070 ExecStartPre=/usr/bin/mkdir -m 1777 -p ${LOCAL_DIR}/host-containers/control (code=exited, status=0/SUCCESS)
   Main PID: 6088 (host-ctr)
      Tasks: 19 (limit: 9185)
     Memory: 42.5M
     CGroup: /system.slice/system-host\x2dcontainers.slice/host-containers@control.service
             └─6088 /usr/bin/host-ctr run -container-id=control -source=328549459982.dkr.ecr.us-west-2.amazonaws.com/bottlerocket-control:v0.4.0 -superpowered=false

.....
bash-5.0# 
bash-5.0# 
bash-5.0# 
bash-5.0# ############# Notice that the new control host container is using the new image source.
bash-5.0# 
bash-5.0# ctr -a /run/host-containerd/containerd.sock task ls
TASK       PID     STATUS    
admin      3823    RUNNING
control    6209    RUNNING
bash-5.0# 
bash-5.0# ctr -a /run/host-containerd/containerd.sock container ls
CONTAINER    IMAGE                                                                                RUNTIME                  
admin        ecr.aws/arn:aws:ecr:us-west-2:328549459982:repository/bottlerocket-admin:v0.5.2      io.containerd.runc.v2    
control      ecr.aws/arn:aws:ecr:us-west-2:328549459982:repository/bottlerocket-control:v0.4.0    io.containerd.runc.v2    
bash-5.0# 
bash-5.0# #

@etungsten
Copy link
Contributor Author

etungsten commented Dec 7, 2020

I noticed while making these new changes that host-containers, the rust binary responsible for managing host container systemd units, unconditionally loops through all host containers to apply their current settings and tries to enables or disables each of their systemd unit. This means that I can't unconditionally try to clean-up host containers with host-ctr before host-containers attempt to enable any particular host-container unit, because it'll do it for every host container regardless of its current running state and whether it's settings have changed or not.

host-containers (the binary) currently does not have the ability to detect settings changes and to only enact them upon impacted host containers. We can however implement this by making host-containers check whether the environment file its writing to for each container is being changed. If the environment file is changed, that means the host container that's being handled right now has to be restarted if it's currently running (enabled=true). If the environment file hasn't changed, then that means we don't need to do anything for the host container being handled. If the environment file does not exist, that means we're going through first boot and we should apply the settings and enact them unconditionally.

I did not try to implement this here because it's not strictly within the scope of this PR. But I can make an issue to follow up with this if people think the approach above is sensible. Please let me know what you think @tjkirch, @bcressey .

@etungsten etungsten requested review from bcressey and tjkirch December 7, 2020 18:36
@tjkirch
Copy link
Contributor

tjkirch commented Dec 7, 2020

^ I think it could be better to improve the way restart-commands are run so that they're always given the list of settings that have changed, rather than the setting name having to be in the restart-command itself. That way we can handle dynamic settings, and not have to inspect the system to see what changed.

For example, right now, services.motd could have a restart-command like my-command settings.motd because we know services.motd is only associated with one setting, but services.host-containers is associated with any host container name under settings.host-containers, so we can't pass (hardcode) a single setting. Instead, if the command was given the changed settings, it could alter its behavior based on what changed, like in this case where we need to know which host container changed. (It could also give us the ability to re-use restart command helpers more often, if they can branch on setting.)

(I just mean this as a different potential follow-up, not as a blocker for this PR)

Copy link
Contributor

@samuelkarp samuelkarp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we separate the refactoring change from the rest of this PR?

sources/host-ctr/cmd/host-ctr/main.go Outdated Show resolved Hide resolved
sources/host-ctr/cmd/host-ctr/main.go Outdated Show resolved Hide resolved
@etungsten
Copy link
Contributor Author

Can we separate the refactoring change from the rest of this PR?

I'm inclined to keep it as part of this PR since we're adding an additional "subcommand-like" functionality to host-ctr here. It feels like the right thing to do as opposed to having to add another flag that completely changes what host-ctr does. I very much regret adding -pull-image-only as a flag instead of a subcommand.

Hopefully the refactoring changes aren't too controversial and produce too much churn. It's mostly refactoring stuff into functions.

@samuelkarp
Copy link
Contributor

I'm inclined to keep it as part of this PR since we're adding an additional "subcommand-like" functionality to host-ctr here. It feels like the right thing to do as opposed to having to add another flag that completely changes what host-ctr does. I very much regret adding -pull-image-only as a flag instead of a subcommand.

Hopefully the refactoring changes aren't too controversial and produce too much churn. It's mostly refactoring stuff into functions.

It might be easier to pull it out into a separate PR that we merge ahead of this one. I don't expect breaking into functions to be controversial, but I'm expecting that I won't be the only one with an opinion on the subcommand implementation and it might reduce churn/rebasing effort for you to do it that way. And separating the functional changes (proper restart behavior) from the refactor should make both easier to review.

@etungsten etungsten marked this pull request as draft December 9, 2020 00:50
@etungsten
Copy link
Contributor Author

This PR now depends on #1235 being merged before proceeding. Will rebase once it does.

@etungsten etungsten mentioned this pull request Dec 9, 2020
host-containers@ systemd units should not stop when host-containerd is
restarted or killed.

By changing host-containers' dependency on host-containerd.service from
`BindsTo=` to `Wants=` we ensure host containers tasks won't be killed if
 host-containerd temporarily stops.

host-ctr then has a chance to reclaim the container task when
host-containerd comes back up.
If the host-container already exists, we should just take over the
helm and not try to replace it with a new container. This is so that
even if we temporarily lose connection with host-containerd, we can
still eventually get the task status back when containerd comes back up.
@etungsten
Copy link
Contributor Author

etungsten commented Dec 9, 2020

Push above rebases upon develop to pull in the refactor for host-ctr.

Tested things and the things still work as expected, the previous test results still stands.

@etungsten etungsten marked this pull request as ready for review December 9, 2020 20:31
@etungsten etungsten requested a review from samuelkarp December 9, 2020 20:32
Copy link
Contributor

@zmrow zmrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎖️

Adds a new subcommand `clean-up` that checks if a given container
exists, if it does, `host-ctr` will attempt to kill the container task
and delete the container.
Now that host-ctr has the ability to rebind to existing host containers.
We want to ensure whenever we enable host containers the container will
be running with its latest configuration.

We utilize `host-ctr`'s clean-up command to clean up any potential
lingering host-container when we're enabling a previously disabled
host-container and whenever a host-container is disabled.
@etungsten
Copy link
Contributor Author

Push above drops the commit for removing KillMode=mixed from the host containers unit files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

host-containers get restarted when host-containerd restarts
6 participants