Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add host resource manager methods #3700

Conversation

prateekchaudhry
Copy link
Contributor

@prateekchaudhry prateekchaudhry commented May 17, 2023

Summary

HostResourceManager keeps account of task resources for tasks running/scheduled to run on the instance Agent is running on. It will keep an account of cummulative CPU, MEMORY, PORTS_TCP, PORTS_UDP and GPU which are the same resources registered with ECS backend to schedule tasks on ECS instances. The overarching goal is to start a task on an instance as soon as the resources for it become available on the instance, and these resources will be kept track of by this manager.

A previous PR (#3684) added initialization for HostResourceManager.

This PR further adds following methods:

  • consume method to keep account of resources of tasks being scheduled to start
  • consumable method which given a set of resources returns whether those resources can be 'consumed' by that HostResourceManager
  • release method to keep account of resources of tasks being stopped and releasing resources and adding back to pool of available resources

These methods are being added here. These will be used to schedule tasks according to above description in further changes.

Related Containers Roadmap Issue

aws/containers-roadmap#325

Testing

  • Added unit tests to cover sanity checks of CPU, Memory, Ports (TCP/UDP) and GPU accounting. Checked that unit tests succeed.

New tests cover the changes: Yes

Description for the changelog

Add host resource manager methods

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

"github.com/aws/amazon-ecs-agent/agent/ecs_client/model/ecs"
"github.com/aws/amazon-ecs-agent/agent/logger"
"github.com/aws/amazon-ecs-agent/agent/utils"
)

// TODO remove this once resource, consume are used
//lint:file-ignore U1000 Ignore all unused code
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This static check directive will be removed in a future change once consume and release are being used.

Comment on lines 42 to 43
// false, err -> resources map has errors, something went wrong
// true, nil -> successfully consumed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also add what's going to happen to the task in those two scenarios, similar to the first case?

// false, nil -> did not consume, should stay pending
// false, err -> resources map has errors, something went wrong
// true, nil -> successfully consumed
func (h *HostResourceManager) consume(taskArn string, resources map[string]*ecs.Resource) (bool, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this function be exported (and also "release" below?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see these functions to be used only in engine package. Unless there are good reasons to export them, I believe we can keep them as it is

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we put host resource manager in its own package? How are consumers of host resource manager supposed to know what methods form its public API?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We put it in engine (and not in it's own package) because we only see HostResourceManager as being used to enable task scheduling, specifically in task_manager and docker_task_engine, and hence did not export the methods as well.

h.hostResourceManagerRWLock.Lock()
defer h.hostResourceManagerRWLock.Unlock()

defer logger.Debug("Consumed resources after task consume call", logger.Fields{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - can probably have a printResources method that can pretty print a resource map

})

// Check if already consumed
_, ok := h.consumedResource[taskArn]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we mean to check taskConsumed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching it! will correct that

agent/engine/host_resource_manager.go Show resolved Hide resolved
Comment on lines 117 to 163
cpuResource, ok := resources["CPU"]
if ok {
if *(h.initialHostResource["CPU"].IntegerValue) < *(h.consumedResource["CPU"].IntegerValue)+*(cpuResource.IntegerValue) {
return false, nil
}
} else {
return false, fmt.Errorf("No CPU in task resources")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we have multiple integer type resources, consider having a common checkConsumableIntType(resourceName string, resources map[string]*ecs.Resource)

Similar for string set type.

return false, nil
}
} else {
return false, fmt.Errorf("No MEMORY in task resources")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who's going to be making sure the key is going to be present? the "converter" that takes a task and outputs the task resources?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The output of ToHostResources(#3706) will be used here, and that will always all these entries

Comment on lines 179 to 186
defer logger.Debug("Consumed resources after task release call", logger.Fields{
"taskArn": taskArn,
"CPU": *h.consumedResource["CPU"].IntegerValue,
"MEMORY": *h.consumedResource["MEMORY"].IntegerValue,
"PORTS_TCP": h.consumedResource["PORTS_TCP"].StringSetValue,
"PORTS_UDP": h.consumedResource["PORTS_UDP"].StringSetValue,
"GPU": *h.consumedResource["GPU"].IntegerValue,
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use a helper print method

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will create the recommended printResources method

cpuResource, ok := resources["CPU"]
if ok {
if *(h.initialHostResource["CPU"].IntegerValue) < *(h.consumedResource["CPU"].IntegerValue)+*(cpuResource.IntegerValue) {
return false, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a log here to indicate which resource type is the task blocked on

portsResource, ok = resources["PORTS_UDP"]
if ok {
taskPortsSlice := portsResource.StringSetValue
for _, port := range taskPortsSlice {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All consumed ports for a single task are going to be in a contiguous sub-slice in consumedResource["PORTS_UDP"], since we add them together. Therefore, instead of removing ports one at a time, I think we can find the starting and ending index and remove all at once.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of implementing as a generic slice check, but can get this optimization in as well

agent/engine/host_resource_manager.go Outdated Show resolved Hide resolved
agent/engine/host_resource_manager.go Outdated Show resolved Hide resolved
initialHostResource map[string]*ecs.Resource
consumedResource map[string]*ecs.Resource
initialHostResource map[string]*ecs.Resource
consumedResource map[string]*ecs.Resource
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit of a meta question here. IIUC the keys to these two maps are resource types such as CPU, memory, and ports. Is the full range of resource types managed by HostResourceManager type static or dynamic? If it is static, then have we thought about adding dedicated fields for each resource type and using new easy-to-use types? ecs.Resource type seems quite error prone as it is not well typed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There will be 5 resource types accounted for in HostResourceManager currently (cpu, memory, ports_tcp, ports_udp, gpu). We are choose ecs.Resource type as this is the same format agent uses to report host resources to ecs when registering instance (here).

Copy link
Contributor

@amogh09 amogh09 May 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. IIUC the goal of HostResourceManager is to manage the state of resources and ecs.Resource type doesn't seem to be easy-to-use for that purpose. It is probably designed for communicating the resources state to the backend.

Also since each resource has its own mutation function (CPU is a continuous variable while ports are a set), it might be worthwhile to define an interface for resource and handle the state changes in the implementations.

This is a non-blocking comment and I am just sharing my opinion. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree a separate type for HostResourceManager can be a bit more useful. On the other hand I also see that by re-using ecs.Resource we are able to piggy back on an existing implementation which could be easy in it's own way as it is declared to define ecs resources (albeit to communicate to backend, perhaps). And using a map, accessing values becomes fairly easy.

If it is not much of a hassle, I will stay with ecs.Resource implementation, as previous, parallel (and future) work is already using this type. But I will be willing to change this upon recommendation :)

@prateekchaudhry prateekchaudhry force-pushed the feature/task-resource-accounting branch 3 times, most recently from d2831ee to 0664d5f Compare May 21, 2023 09:36
return true, nil
}

ok, err := h.consumable(resources)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does it typically mean when we have 1 resource type errored out of the 5 resource types, and the other ones were fine? What will be our action when this happens?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should never happen, but putting in as a check. The task should fail to start with incorrect resource conversion as put in the comment above consume // false, err -> resources map has errors, task should fail as cannot schedule with 'wrong' resource map (this basically never happens)

agent/engine/host_resource_manager.go Outdated Show resolved Hide resolved
// false, nil -> did not consume, should stay pending
// false, err -> resources map has errors, something went wrong
// true, nil -> successfully consumed
func (h *HostResourceManager) consume(taskArn string, resources map[string]*ecs.Resource) (bool, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we put host resource manager in its own package? How are consumers of host resource manager supposed to know what methods form its public API?

@prateekchaudhry prateekchaudhry force-pushed the feature/task-resource-accounting branch from 58844e5 to b9eb126 Compare May 23, 2023 23:09
amogh09
amogh09 previously approved these changes May 23, 2023
amogh09
amogh09 previously approved these changes May 24, 2023
}

// Checks all resources exists and their values are not nil
func checkResourcesHealth(resources map[string]*ecs.Resource) error {
Copy link
Contributor

@yinyic yinyic May 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we necessarily want to check that all resource types exist in the map. Technically, for tasks that don't consume certain resource types, it should be okay for them to leave it out.

We might want to do check, instead:

for each key,val in resources:
  does key exist in host resources? i.e. is it a recognized resource type? if not, throw error
  if valid key, then check if value type matches
  Lastly, check if the specific type of value is not nil

Comment on lines 280 to 284
for ; end >= 0; end-- {
if *s1[end] == *s2[len(s2)-1] {
break
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can have end = begin + len(s2)?

Realmonia
Realmonia previously approved these changes May 24, 2023
@prateekchaudhry prateekchaudhry dismissed stale reviews from Realmonia and amogh09 via f65e6f5 May 24, 2023 18:50
@prateekchaudhry prateekchaudhry force-pushed the feature/task-resource-accounting branch from f65e6f5 to 44f0d4d Compare May 24, 2023 19:40
@prateekchaudhry prateekchaudhry merged commit 3348d1d into aws:feature/task-resource-accounting May 25, 2023
@prateekchaudhry prateekchaudhry mentioned this pull request Jun 22, 2023
sparrc added a commit to sparrc/amazon-ecs-agent that referenced this pull request Jun 30, 2023
sparrc added a commit that referenced this pull request Jul 5, 2023
prateekchaudhry added a commit to prateekchaudhry/amazon-ecs-agent that referenced this pull request Jul 12, 2023
prateekchaudhry added a commit that referenced this pull request Jul 12, 2023
* Revert "Revert "host resource manager initialization""

This reverts commit dafb967.

* Revert "Revert "Add method to get host resources reserved for a task (#3706)""

This reverts commit 8d824db.

* Revert "Revert "Add host resource manager methods (#3700)""

This reverts commit bec1303.

* Revert "Revert "Remove task serialization and use host resource manager for task resources (#3723)""

This reverts commit cb54139.

* Revert "Revert "add integ tests for task accounting (#3741)""

This reverts commit 61ad010.

* Revert "Revert "Change reconcile/container update order on init and waitForHostResources/emitCurrentStatus order (#3747)""

This reverts commit 60a3f42.

* Revert "Revert "Dont consume host resources for tasks getting STOPPED while waiting in waitingTasksQueue (#3750)""

This reverts commit 8943792.
Realmonia pushed a commit that referenced this pull request Jul 20, 2023
* Revert reverted changes for task resource accounting (#3796)

* Revert "Revert "host resource manager initialization""

This reverts commit dafb967.

* Revert "Revert "Add method to get host resources reserved for a task (#3706)""

This reverts commit 8d824db.

* Revert "Revert "Add host resource manager methods (#3700)""

This reverts commit bec1303.

* Revert "Revert "Remove task serialization and use host resource manager for task resources (#3723)""

This reverts commit cb54139.

* Revert "Revert "add integ tests for task accounting (#3741)""

This reverts commit 61ad010.

* Revert "Revert "Change reconcile/container update order on init and waitForHostResources/emitCurrentStatus order (#3747)""

This reverts commit 60a3f42.

* Revert "Revert "Dont consume host resources for tasks getting STOPPED while waiting in waitingTasksQueue (#3750)""

This reverts commit 8943792.

* fix memory resource accounting for multiple containers in single task (#3782)

* fix memory resource accounting for multiple containers

* change unit tests for multiple containers, add unit test for awsvpc
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants