Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove task serialization and use host resource manager for task resources #3723

Conversation

prateekchaudhry
Copy link
Contributor

@prateekchaudhry prateekchaudhry commented May 30, 2023

Summary

ECS Agent ensures host resources to be available on the instance before running a task. Currently this is implemented through serialization - when scheduling a task, agent waits for all previously stopping tasks, i.e. with StopSequenceNumber of payload from acs less than seqnum of payload of the requested task, to stop.

This PR removes this serialization behavior and instead uses HostResourceManager to schedule tasks and a FIFO task queue built into docker_task_engine to queue tasks instead. This will hence use cpu, memory, ports(tcp/udp) and number of gpus available to manage tasks, and start progressing tasks as soon as resources for them start becoming available - instead of all stopping tasks to stop.

Implementation details

  • Removes package sequential_waitgroup, and references related to StartSequenceNumber and StopSequenceNumber which are constructs related to task serialization

  • Tasks get queued in a waitingTaskQueue and wait for host resources (managed through HostResourceManager) to become available. A goroutine monitorQueuedTasks dequeues and starts waking up each of the waiting the tasks as and when resources start becoming available. When it can not dequeue anymore because resources are not available, it waits

  • When a task stops or when a new task arrives, it wakes up the monitorQueuedTasks in case it is blocked

  • Management of host resources When a task gets resources accounted for by the monitorQueuedTasks, resources are consumed. When a task changes it knownStatus to STOPPED and emits a change of state, resources are released

  • For Agent restarts, there is a reconcileHostResources implemented during synchronizeState which synchronizes HostResourceManager data structures according to known task states. If any container has been known to progressed beyond ContainerStatusNone state, then host resources are consumed.

Related PRs

Related Containers Roadmap Issue

aws/containers-roadmap#325

Testing

Manually tested reconciliation behavior with agent restarts and verified resources are allocated correctly from agent logs

level=debug time=2023-06-01T00:55:50Z msg="Task host resources to account for" MEMORY=1024 PORTS_TCP=[] PORTS_UDP=[] GPU=0 taskArn="arn:aws:ecs:us-west-2:<>:task/taskAccounting/..." CPU=1024

New tests cover the changes: Yes
TestTaskWaitForHostResources unit test to test task queueing/dequeuing

Description for the changelog

Remove task serialization and use host resource manager for task resources

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@prateekchaudhry prateekchaudhry changed the title [WIP] Remove task serialization and use host resource manager for scheduling Remove task serialization and use host resource manager for scheduling May 30, 2023
@prateekchaudhry prateekchaudhry changed the title Remove task serialization and use host resource manager for scheduling Remove task serialization and use host resource manager for task resources May 30, 2023
Yiyuanzzz
Yiyuanzzz previously approved these changes May 30, 2023
// Call to release here for stopped tasks should always succeed
// Idempotent release call
if taskStatus.Terminal() {
err := engine.hostResourceManager.release(task.Arn, resources)
Copy link
Member

@fierlion fierlion May 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when will this be useful? Will it only ever be a no-op if the agent is always starting from zero?
It would help if you could offer an example of where this might be useful in future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not particularly useful right now, but this is a generalized implementation for keeping HostResourceManager in sync with engine. So this might find more uses in future, such as to keep periodic sync between engine and resource manager.

// Consume host resources if task has progressed
// Call to consume here should always succeed
// Idempotent consume call
if !task.IsInternal && taskStatus > apitaskstatus.TaskCreated {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to check for taskStatus == apitaskstatus.TaskCreated | TaskRunning

Comment on lines 332 to 338
waitingTaskQueueSingleLen := false
engine.waitingTasksLock.Lock()
waitingTaskQueueSingleLen = len(engine.waitingTaskQueue) == 1
engine.waitingTasksLock.Unlock()
if waitingTaskQueueSingleLen {
engine.monitorQueuedTaskEvent <- struct{}{}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite following this logic - I think it's sufficient to wake up the queue when we enqueue and when a task stops

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack - making it the channel buffered and a no-op/empty default

break
}
}
logger.Debug("No more tasks in Waiting Task Queue, waiting for new tasks")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit - no more tasks could be started at this moment

Comment on lines 394 to 395
consumable, err := engine.hostResourceManager.consumableSafe(taskHostResources)
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit- redundant check

fierlion
fierlion previously approved these changes May 31, 2023
Copy link
Member

@fierlion fierlion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see another test or three covering the engine as an opaque box. You can add this as a follow up.

resourcesToRelease := task.ToHostResources()
err := engine.hostResourceManager.release(task.Arn, resourcesToRelease)
if err != nil {
logger.Critical("Failed to release resources after tast stopped", logger.Fields{field.TaskARN: task.Arn})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking: small typo here "tast"

Comment on lines +326 to +327
// Always wakes up when at least one event arrives on buffered channel monitorQueuedTaskEvent
// but does not block if monitorQueuedTasks is already processing queued tasks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we elaborate a little more in the comment on 1) when we will be invoking this method (who will be sending messages onto the channel), and 2) why is buffer size of one sufficient (why's it okay to drop any additional messages)

Comment on lines +597 to +598
// Before starting managedTask goroutines, pre-allocate resources for already running
// tasks in host resource manager
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly "already running", more like tasks that have progressed beyond the resource consumption check

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will update these in follow up PR

Copy link
Contributor

@yinyic yinyic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both comments are on comments, gonna approve to unblock. Please update the comments in the next PR (I'm assuming we'll have a follow-up with more/updated tests)

@prateekchaudhry prateekchaudhry merged commit e00484f into aws:feature/task-resource-accounting Jun 1, 2023
@prateekchaudhry prateekchaudhry mentioned this pull request Jun 22, 2023
sparrc added a commit to sparrc/amazon-ecs-agent that referenced this pull request Jun 30, 2023
sparrc added a commit that referenced this pull request Jul 5, 2023
prateekchaudhry added a commit to prateekchaudhry/amazon-ecs-agent that referenced this pull request Jul 12, 2023
prateekchaudhry added a commit that referenced this pull request Jul 12, 2023
* Revert "Revert "host resource manager initialization""

This reverts commit dafb967.

* Revert "Revert "Add method to get host resources reserved for a task (#3706)""

This reverts commit 8d824db.

* Revert "Revert "Add host resource manager methods (#3700)""

This reverts commit bec1303.

* Revert "Revert "Remove task serialization and use host resource manager for task resources (#3723)""

This reverts commit cb54139.

* Revert "Revert "add integ tests for task accounting (#3741)""

This reverts commit 61ad010.

* Revert "Revert "Change reconcile/container update order on init and waitForHostResources/emitCurrentStatus order (#3747)""

This reverts commit 60a3f42.

* Revert "Revert "Dont consume host resources for tasks getting STOPPED while waiting in waitingTasksQueue (#3750)""

This reverts commit 8943792.
Realmonia pushed a commit that referenced this pull request Jul 20, 2023
* Revert reverted changes for task resource accounting (#3796)

* Revert "Revert "host resource manager initialization""

This reverts commit dafb967.

* Revert "Revert "Add method to get host resources reserved for a task (#3706)""

This reverts commit 8d824db.

* Revert "Revert "Add host resource manager methods (#3700)""

This reverts commit bec1303.

* Revert "Revert "Remove task serialization and use host resource manager for task resources (#3723)""

This reverts commit cb54139.

* Revert "Revert "add integ tests for task accounting (#3741)""

This reverts commit 61ad010.

* Revert "Revert "Change reconcile/container update order on init and waitForHostResources/emitCurrentStatus order (#3747)""

This reverts commit 60a3f42.

* Revert "Revert "Dont consume host resources for tasks getting STOPPED while waiting in waitingTasksQueue (#3750)""

This reverts commit 8943792.

* fix memory resource accounting for multiple containers in single task (#3782)

* fix memory resource accounting for multiple containers

* change unit tests for multiple containers, add unit test for awsvpc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants