Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix loading CSI driver container from state if it exists #3970

Merged
merged 1 commit into from
Oct 24, 2023

Conversation

mye956
Copy link
Contributor

@mye956 mye956 commented Oct 17, 2023

Summary

This PR will address the issue where agent won't be able to recognize an existing running CSI driver daemon task after an agent restart. Because of this, when we get a new EBS attachment payload, agent will start up a new CSI driver daemon task even if there's already an existing running CSI driver task. While all of the managed tasks are properly populated back into the task engine during the setup of agent start, the list of managed daemon task aren't. To fix this, we'll check if there's an existing non-stopped CSI driver daemon task right after we load all of the managed tasks from BoltDB on agent restart.

Implementation details

Implemented the following:

  • agent/api/task/task.go: New functionality called IsManagedDaemonTask where it checks if the task is a managed daemon if it satisfy the following:
    • Was internally made by agent
    • Not stopped (i.e. not terminal)
    • Has container of type ContainerManagedDaemon
    • If all of the above satisfy, then the task will be added to the map of daemon tasks within loadTask
  • agent/api/container/container.go:
    • New functionality called IsManagedDaemonContainer where it checks if a container is of type ContainerManagedDaemon. This will be called in IsManagedDaemonTask as part of the managed daemon task check.
    • New functionality called GetImageName where it gets the image name without the tags of a container
  • agent/api/container/containertype.go: Added ContainerManagedDaemon to the map of container types that's used for encoding and decoding ContainerType objects
  • agent/engine/data.go: Modified loadTask to check if any of the tasks are considered a managed daemon task. If a task is then we add it to the map of managed daemon tasks in our task engine.
  • agent/ebs/watcher.go: Added an extra condition that we should also be creating a new CSI driver daemon task if the existing CSI driver daemon task has stopped (i.e. terminal).

(Note: There should only be at most one of each managed daemon task running on one instance at a time; i.e. multiple managed daemon tasks can be running on one instance as long as they're not the same one)

Testing

Added new unit tests.

Manual testing:
Ran an EBS-backed task which then created the CSI driver daemon task. After a bit, I restarted agent with the CSI driver task still running and was able to confirm that agent was able to properly load back the existing CSI driver daemon task.

level=debug time=2023-10-18T17:11:51Z msg="Task is not internally made"
level=debug time=2023-10-18T17:11:51Z msg="Task is not a managed daemon task: arn:aws:ecs:us-west-2::task/evm-test-gamma/9bb5422fecda42088bcc3451a8d37dbb" module=data.go
level=debug time=2023-10-18T17:11:51Z msg="Container is running and is internal"
level=debug time=2023-10-18T17:11:51Z msg="Successfully set managed daemon task for: ebs-csi-driver" module=data.go
level=info time=2023-10-18T17:11:51Z msg="Cluster was successfully restored" cluster="evm-test-gamma"
level=debug time=2023-10-18T17:11:51Z msg="Loading pause container tarball:" image="/images/amazon-ecs-pause.tar"
level=debug time=2023-10-18T17:11:51Z msg="Inspecting container image: " image="amazon/amazon-ecs-pause:0.1.0"

Ran another EBS-backed task and agent didn't create a new CSI driver daemon task as well as reused the existing task.

[ec2-user@ip-172-31-35-197 amazon-ecs-agent]$ docker ps
CONTAINER ID   IMAGE                            COMMAND                  CREATED         STATUS                   PORTS     NAMES
953d6d9ddf96   mreferre/yelb-db:0.5             "docker-entrypoint.s…"   2 minutes ago   Up 2 minutes                       ecs-evm-db-6-db-fee385a6f78fe3a1af01
f09b5e9f3526   amazon/amazon-ecs-agent:latest   "/agent"                 3 minutes ago   Up 3 minutes (healthy)             ecs-agent
b3ede243b73a   ebs-csi-driver:latest            "/bin/ebs-csi-driver…"   6 minutes ago   Up 6 minutes                       ecs---ecs-managed-ebs-csi-driver-a48dcaeed3cfdfdafb01

New tests cover the changes:Yes

Description for the changelog

Fix local agent state for CSI driver daemon task

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@mye956 mye956 force-pushed the ebsCSILocalState branch 4 times, most recently from 7a2b7b1 to 979ff2e Compare October 18, 2023 18:02
@mye956 mye956 marked this pull request as ready for review October 18, 2023 18:47
@mye956 mye956 requested a review from a team as a code owner October 18, 2023 18:47
@mye956 mye956 changed the title [WIP] Fix loading CSI driver container from state if it exists Fix loading CSI driver container from state if it exists Oct 18, 2023

for _, c := range task.Containers {
containerStatus := c.GetKnownStatus()
if containerStatus.IsRunning() && c.IsInternal() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does the container need to be running for us to check if it's name matches the "managed daemon" names?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couldn't this be a race condition where the agent gets a new EBS task before the existing managed daemon task containers have gone to running?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good question, we don't want to run into a scenario where the task/container was stopped but we still add it to the list of manged daemon tasks. otherwise agent won't be able to create/start a new CSI driver task prior to staging EBS volumes since it'll think that there's already an existing active CSI driver task

Copy link
Contributor Author

@mye956 mye956 Oct 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couldn't this be a race condition where the agent gets a new EBS task before the existing managed daemon task containers have gone to running?

we should be taking care of this when handling an EBS attachment payload -> https://github.com/aws/amazon-ecs-agent/blob/dev/agent/ebs/watcher.go#L132
we first check that if there's already an existing CSI driver task then we shouldn't create another.

Thinking about it now, we should also include if there's a CSI driver task in the process of being created already when loading from the BoltDB after agent restarts

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One other thought about removing the IsRunning check: Managed Daemons have a container restart policy of on failure which implies that they may be running or stopped. We should only ever have one container per daemon so we can track its health regardless of its process status. We'll depend on the container restart policy to handle the daemon lifecycle and shouldn't launch redundant daemons.

@mye956
Copy link
Contributor Author

mye956 commented Oct 18, 2023

I think the concept here would break when future managed daemons are added. For this problem don't we specifically need to figure out if we have the "ebs-csi-driver" task running? Not any managed daemon task?

Hmm, I don't quite follow here. If we were to introduce a new managed daemon then it could be added to the set of managed daemon images and also check for that as well along with the CSI driver image. You're completely right that the goal for this PR is to check if there's already an CSI driver task. Was also thinking of long term as well when we have more managed daemon tasks.

@mye956 mye956 force-pushed the ebsCSILocalState branch 4 times, most recently from 66eeb0d to 214085c Compare October 24, 2023 16:45
agent/api/task/task.go Outdated Show resolved Hide resolved
imageName := c.GetImageName()
return imageName, true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if a managed daemon in future has more than a single container we'll need to figure out which container is THE container. This might be an essential container, or else we'll need to enforce a one-daemon-container per daemon task rule. We might want to add a TODO to track this as a task-level field in future.
https://github.com/aws/amazon-ecs-agent/blob/master/agent/engine/daemonmanager/daemon_manager_linux.go#L131C3-L131C103
The daemons are currently very specifically focused on a single image as base.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

agent/ebs/watcher.go Show resolved Hide resolved
agent/ebs/watcher.go Outdated Show resolved Hide resolved
@mye956 mye956 merged commit 1391502 into aws:dev Oct 24, 2023
38 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants