-
Notifications
You must be signed in to change notification settings - Fork 619
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix loading CSI driver container from state if it exists #3970
Conversation
7a2b7b1
to
979ff2e
Compare
agent/api/task/task.go
Outdated
|
||
for _, c := range task.Containers { | ||
containerStatus := c.GetKnownStatus() | ||
if containerStatus.IsRunning() && c.IsInternal() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does the container need to be running for us to check if it's name matches the "managed daemon" names?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couldn't this be a race condition where the agent gets a new EBS task before the existing managed daemon task containers have gone to running?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good question, we don't want to run into a scenario where the task/container was stopped but we still add it to the list of manged daemon tasks. otherwise agent won't be able to create/start a new CSI driver task prior to staging EBS volumes since it'll think that there's already an existing active CSI driver task
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couldn't this be a race condition where the agent gets a new EBS task before the existing managed daemon task containers have gone to running?
we should be taking care of this when handling an EBS attachment payload -> https://github.com/aws/amazon-ecs-agent/blob/dev/agent/ebs/watcher.go#L132
we first check that if there's already an existing CSI driver task then we shouldn't create another.
Thinking about it now, we should also include if there's a CSI driver task in the process of being created already when loading from the BoltDB after agent restarts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One other thought about removing the IsRunning check: Managed Daemons have a container restart policy of on failure
which implies that they may be running or stopped. We should only ever have one container per daemon so we can track its health regardless of its process status. We'll depend on the container restart policy to handle the daemon lifecycle and shouldn't launch redundant daemons.
Hmm, I don't quite follow here. If we were to introduce a new managed daemon then it could be added to the set of managed daemon images and also check for that as well along with the CSI driver image. You're completely right that the goal for this PR is to check if there's already an CSI driver task. Was also thinking of long term as well when we have more managed daemon tasks. |
66eeb0d
to
214085c
Compare
imageName := c.GetImageName() | ||
return imageName, true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if a managed daemon in future has more than a single container we'll need to figure out which container is THE container. This might be an essential container, or else we'll need to enforce a one-daemon-container per daemon task rule. We might want to add a TODO to track this as a task-level field in future.
https://github.com/aws/amazon-ecs-agent/blob/master/agent/engine/daemonmanager/daemon_manager_linux.go#L131C3-L131C103
The daemons are currently very specifically focused on a single image as base.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
214085c
to
44082d5
Compare
Summary
This PR will address the issue where agent won't be able to recognize an existing running CSI driver daemon task after an agent restart. Because of this, when we get a new EBS attachment payload, agent will start up a new CSI driver daemon task even if there's already an existing running CSI driver task. While all of the managed tasks are properly populated back into the task engine during the setup of agent start, the list of managed daemon task aren't. To fix this, we'll check if there's an existing non-stopped CSI driver daemon task right after we load all of the managed tasks from BoltDB on agent restart.
Implementation details
Implemented the following:
agent/api/task/task.go
: New functionality calledIsManagedDaemonTask
where it checks if the task is a managed daemon if it satisfy the following:ContainerManagedDaemon
loadTask
agent/api/container/container.go
:IsManagedDaemonContainer
where it checks if a container is of typeContainerManagedDaemon
. This will be called inIsManagedDaemonTask
as part of the managed daemon task check.GetImageName
where it gets the image name without the tags of a containeragent/api/container/containertype.go
: AddedContainerManagedDaemon
to the map of container types that's used for encoding and decodingContainerType
objectsagent/engine/data.go
: ModifiedloadTask
to check if any of the tasks are considered a managed daemon task. If a task is then we add it to the map of managed daemon tasks in our task engine.agent/ebs/watcher.go
: Added an extra condition that we should also be creating a new CSI driver daemon task if the existing CSI driver daemon task has stopped (i.e. terminal).(Note: There should only be at most one of each managed daemon task running on one instance at a time; i.e. multiple managed daemon tasks can be running on one instance as long as they're not the same one)
Testing
Added new unit tests.
Manual testing:
Ran an EBS-backed task which then created the CSI driver daemon task. After a bit, I restarted agent with the CSI driver task still running and was able to confirm that agent was able to properly load back the existing CSI driver daemon task.
Ran another EBS-backed task and agent didn't create a new CSI driver daemon task as well as reused the existing task.
New tests cover the changes:Yes
Description for the changelog
Fix local agent state for CSI driver daemon task
Licensing
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.