Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dont consume host resources for tasks getting STOPPED while waiting in waitingTasksQueue #3750

Conversation

prateekchaudhry
Copy link
Contributor

@prateekchaudhry prateekchaudhry commented Jun 15, 2023

Summary

Context: Task Resource Accounting feature implements queuing (monitorQueuedTasks) for tasks to go through where each task waits to 'consume' resources in host_resource_manager. host_resource_manager keeps an account of each task which has acquired resources to progress for task creation/running.

If a ACS StopTask request arrives at Agent while the task is still in this queue, overseeTask - waiting in waitForHostResources, falls through and does not wait for the event from monitorQueuedTasks, ends up calling 'host_resource_manager.release()` during emitTaskEvent.

But if these STOPPED tasks are dequeued later (in docker_task_engine), they just write to the channel with no listeners after the host_resource_manager.consume() call in docker_task_engine, and the resources persist in host_resource_manager.

This PR fixes this by not calling host_resource_manager.consume() for tasks in monitorQueuedTasks whose desired status has changed to STOPPED. Some cautionary implementation steps have been taken to isolate working of monitorQueuedTasks as described in implementation

Implementation

  • Added a monitorQueuedTasksLock for the main body of the loop in monitorQueuedTasks. This is to synchronize the processing of the topTask in monitorQueuedTasks and any parallel update to its desired status - say from ACS updates
  • The main body of this processing in monitorQueuedTasks has been moved to a method tryDequeueWaitingTasks - to put into this critical section of monitorQueuedTasksLock
  • monitorQueuedTasksLock is also used in handleDesiredStatusChange which updates the desired status of a managed task
  • Adds an integ test TestHostResourceManagerStopTaskNotBlockWaitingTasks to test this behavior
  • Made container names unique in existing test TestHostResourceManagerResourceUtilization for easier debugging

As a result of this implementation, consider the scenarios

  1. StopTask arrives and is processed when the task is not yet processed by monitorQueuedTasks
  • This is safe because when task is dequeued, it's desired status is read as STOPPED and queue returns on consumedHostResourceEvent without consuming resources
  1. StopTask arrives while the task is being processed by monitorQueuedTasks
  • Due to critical sections put in place, desired status will be updated after monitorQueuedTasks is done processing. Resources will be consumed, and released in emitTaskEvent
  1. StopTask arrives and is processed any time after monitorQueuedTasks is done processing
  • Desired status will be updated after monitorQueuedTasks is done processing. Resources will be consumed and released in emitTaskEvent

Related Containers Roadmap Issue

aws/containers-roadmap#325

Testing

Verified by

  • Starting and stopping a task with stopTimeout of 300s. Starting another same task and stopped it. This task goes to the waiting queue and also returns in overseeTask goroutine. Later this task does not call consume and checked debug logs
level=info time=2023-06-15T07:22:18Z msg="Task desired status STOPPED while waiting for host resources, not consuming" taskARN="arn:aws:ecs:us-west-2:<account_id>task/taskAccounting/aa77bc7083d9401b8f4aac32424abed2"
  • Added an integ test TestHostResourceManagerStopTaskNotBlockWaitingTasks and verified it succeeds. This test simulates ACS stopTask for a tasks stuck in waitingTasksQueue and verifies the resources are released in host_resource_manager

New tests cover the changes:
Yes

Description for the changelog

Dont consume host resources for tasks getting STOPPED while waiting

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@prateekchaudhry prateekchaudhry requested a review from a team as a code owner June 15, 2023 07:36
@prateekchaudhry prateekchaudhry force-pushed the taskAccountingAcsStopTask branch from b1c74ce to 0f78263 Compare June 15, 2023 17:16
@prateekchaudhry prateekchaudhry force-pushed the taskAccountingAcsStopTask branch 2 times, most recently from 376dc60 to c3b121a Compare June 15, 2023 21:14
Yiyuanzzz
Yiyuanzzz previously approved these changes Jun 15, 2023
@prateekchaudhry prateekchaudhry changed the title Dont consume host resources for tasks getting STOPPED while waiting Dont consume host resources for tasks getting STOPPED while waiting in waitingTasksQueue Jun 16, 2023
@@ -1227,7 +1227,7 @@ func TestHostResourceManagerResourceUtilization(t *testing.T) {
testTask := createTestTask(taskArn)

// create container
A := createTestContainerWithImageAndName(baseImageForOS, "A")
A := createTestContainerWithImageAndName(baseImageForOS, fmt.Sprintf("A-%d", i))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an unrelated change in another test case related to task resource accounting. Can make debugging easier in future if each task and container are named uniquely.

agent/engine/docker_task_engine.go Show resolved Hide resolved
agent/engine/task_manager.go Show resolved Hide resolved
agent/engine/task_manager.go Outdated Show resolved Hide resolved
@Realmonia
Copy link
Contributor

Also a general question with agent restart (sorry if this is me missing some context) - how are we storing tasks in queue to statefile (boltdb) when agent restart or recover from unexpected exit?

@prateekchaudhry
Copy link
Contributor Author

Also a general question with agent restart (sorry if this is me missing some context) - how are we storing tasks in queue to statefile (boltdb) when agent restart or recover from unexpected exit?

We do not store tasks in queue to statefile. On restarts, reconcileHostResources method in docker_task_engine.go takes the tasks which have progressed beyond waitForHostResources state and restores the host_resource_manager state.

So a task which has not progressed beyond that state will queue up again, as on restart, each task's startTask is called again. Those tasks which are not pre-allocated during reconcileHostResources will thus end up queuing up again.

@Realmonia
Copy link
Contributor

We do not store tasks in queue to statefile. On restarts, reconcileHostResources method in docker_task_engine.go takes the tasks which have progressed beyond waitForHostResources state and restores the host_resource_manager state.

So a task which has not progressed beyond that state will queue up again, as on restart, each task's startTask is called again. Those tasks which are not pre-allocated during reconcileHostResources will thus end up queuing up again.

Oh I understand now. So those tasks are kept in state like TaskCreated. SGTM.

@prateekchaudhry prateekchaudhry merged commit 6709bfe into aws:feature/task-resource-accounting Jun 16, 2023
prateekchaudhry added a commit that referenced this pull request Jun 21, 2023
…n waitingTasksQueue (#3750)

* dont consume resources for acs stopped tasks

* add integ test for the stopTask in waitingTaskQueue case

* remove discardConsumedHostResourceEvents
prateekchaudhry added a commit that referenced this pull request Jun 22, 2023
…n waitingTasksQueue (#3750)

* dont consume resources for acs stopped tasks

* add integ test for the stopTask in waitingTaskQueue case

* remove discardConsumedHostResourceEvents
@prateekchaudhry prateekchaudhry mentioned this pull request Jun 22, 2023
sparrc added a commit to sparrc/amazon-ecs-agent that referenced this pull request Jun 30, 2023
sparrc added a commit that referenced this pull request Jul 5, 2023
…aiting in waitingTasksQueue (#3750)"

This reverts commit 8b96a11.
prateekchaudhry added a commit that referenced this pull request Jul 12, 2023
…n waitingTasksQueue (#3750)

* dont consume resources for acs stopped tasks

* add integ test for the stopTask in waitingTaskQueue case

* remove discardConsumedHostResourceEvents
prateekchaudhry added a commit to prateekchaudhry/amazon-ecs-agent that referenced this pull request Jul 12, 2023
… while waiting in waitingTasksQueue (aws#3750)""

This reverts commit 8943792.
prateekchaudhry added a commit that referenced this pull request Jul 12, 2023
* Revert "Revert "host resource manager initialization""

This reverts commit dafb967.

* Revert "Revert "Add method to get host resources reserved for a task (#3706)""

This reverts commit 8d824db.

* Revert "Revert "Add host resource manager methods (#3700)""

This reverts commit bec1303.

* Revert "Revert "Remove task serialization and use host resource manager for task resources (#3723)""

This reverts commit cb54139.

* Revert "Revert "add integ tests for task accounting (#3741)""

This reverts commit 61ad010.

* Revert "Revert "Change reconcile/container update order on init and waitForHostResources/emitCurrentStatus order (#3747)""

This reverts commit 60a3f42.

* Revert "Revert "Dont consume host resources for tasks getting STOPPED while waiting in waitingTasksQueue (#3750)""

This reverts commit 8943792.
Realmonia pushed a commit that referenced this pull request Jul 20, 2023
* Revert reverted changes for task resource accounting (#3796)

* Revert "Revert "host resource manager initialization""

This reverts commit dafb967.

* Revert "Revert "Add method to get host resources reserved for a task (#3706)""

This reverts commit 8d824db.

* Revert "Revert "Add host resource manager methods (#3700)""

This reverts commit bec1303.

* Revert "Revert "Remove task serialization and use host resource manager for task resources (#3723)""

This reverts commit cb54139.

* Revert "Revert "add integ tests for task accounting (#3741)""

This reverts commit 61ad010.

* Revert "Revert "Change reconcile/container update order on init and waitForHostResources/emitCurrentStatus order (#3747)""

This reverts commit 60a3f42.

* Revert "Revert "Dont consume host resources for tasks getting STOPPED while waiting in waitingTasksQueue (#3750)""

This reverts commit 8943792.

* fix memory resource accounting for multiple containers in single task (#3782)

* fix memory resource accounting for multiple containers

* change unit tests for multiple containers, add unit test for awsvpc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants