-
Notifications
You must be signed in to change notification settings - Fork 612
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change reconcile/container update order on init and waitForHostResources/emitCurrentStatus order #3747
Change reconcile/container update order on init and waitForHostResources/emitCurrentStatus order #3747
Conversation
a770215
to
dddabb1
Compare
// Wait here until enough resources are available on host for the task to progress | ||
// - Waits until host resource manager succesfully 'consume's task resources and returns | ||
// - For tasks which have crossed this stage before (on agent restarts), resources are pre-consumed - returns immediately | ||
// (resources are later 'release'd on Stopped task emitTaskEvent call) | ||
mtask.waitForHostResources() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can add a small optimization here to skip invoking waitForHostResources
if task known status is not STOPPED
? Since we know in that case the host resources will be released immediately anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes let me add that. Although for this particular 'restart case', resources would be detected as 'STOPPED', and waitForHostResources
behavior is resources are pre-consumed - returns immediately
i.e. without queueing.
495ad7f
…ces/emitCurrentStatus order (#3747)
…ces/emitCurrentStatus order (#3747)
…ces/emitCurrentStatus order (#3747)
* Revert "Revert "host resource manager initialization"" This reverts commit dafb967. * Revert "Revert "Add method to get host resources reserved for a task (#3706)"" This reverts commit 8d824db. * Revert "Revert "Add host resource manager methods (#3700)"" This reverts commit bec1303. * Revert "Revert "Remove task serialization and use host resource manager for task resources (#3723)"" This reverts commit cb54139. * Revert "Revert "add integ tests for task accounting (#3741)"" This reverts commit 61ad010. * Revert "Revert "Change reconcile/container update order on init and waitForHostResources/emitCurrentStatus order (#3747)"" This reverts commit 60a3f42. * Revert "Revert "Dont consume host resources for tasks getting STOPPED while waiting in waitingTasksQueue (#3750)"" This reverts commit 8943792.
* Revert reverted changes for task resource accounting (#3796) * Revert "Revert "host resource manager initialization"" This reverts commit dafb967. * Revert "Revert "Add method to get host resources reserved for a task (#3706)"" This reverts commit 8d824db. * Revert "Revert "Add host resource manager methods (#3700)"" This reverts commit bec1303. * Revert "Revert "Remove task serialization and use host resource manager for task resources (#3723)"" This reverts commit cb54139. * Revert "Revert "add integ tests for task accounting (#3741)"" This reverts commit 61ad010. * Revert "Revert "Change reconcile/container update order on init and waitForHostResources/emitCurrentStatus order (#3747)"" This reverts commit 60a3f42. * Revert "Revert "Dont consume host resources for tasks getting STOPPED while waiting in waitingTasksQueue (#3750)"" This reverts commit 8943792. * fix memory resource accounting for multiple containers in single task (#3782) * fix memory resource accounting for multiple containers * change unit tests for multiple containers, add unit test for awsvpc
Summary
This PR fixes a bug figured out in task resource accounting on Agent Restarts. The issue is when Agent is down. a container stops (repro by
docker stop <container_id>
) and agent comes back up, task resources are not released by host resource manager.Problems in current orders of calls are :
reconcileHostResources
(see Remove task serialization and use host resource manager for task resources #3723) is being called after stopped containers are accounted for in task engine. So if a container stops while the agent is stopped, reconcileHostResources may not pre-'consume' resources for a task, because it's overseeTask would still need to be run for cleanup if it's status has changed to stop. As a result,reconcileHostResources
should be called beforefilterTasksToStartUnsafe
which updates the container and task statuses. This order is updated with comments.emitCurrentStatus
also does anemitTaskEvent
call - which releases 'consumed' resources in host resource manager (seeManagement of host resources
in summary in Remove task serialization and use host resource manager for task resources #3723). AndwaitForHostResources()
again re-allocates resources, leading to persistent accounted for resources in host resource manager in these cases (when tasks stop when agent is down). Changing it's order afterwaitForHostResources()
call and updating comments.This PR also makes a change in
host_resource_manager
andToHostResources
intask.go
to dereference ports from[]*string
to[]string
during logging for more understandable debug logging. This change results in outputting actual port values instead of lvalues of the ports which might not be always relevant. See PORTS_TCP and PORTS_UDP after the change here[Debug] logger=structured msg="Consumed resources after task consume call" CPU=512 MEMORY=768 PORTS_TCP=[22 23] PORTS_UDP=[1000 1001] GPU= 1 taskArn="arn:aws:ecs:us-east-1:<aws_account_id>:task/cluster-name/11111"
Related Containers Roadmap Issue
aws/containers-roadmap#325
Testing
With debug logs, verified agent restarts function properly and resources are accounted and released properly for following scenarios :
Description for the changelog
Fix Agent restarts and ports logging with task resource accounting
Licensing
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.