Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Host Resource Manager initialization #3684

Conversation

Yiyuanzzz
Copy link
Contributor

@Yiyuanzzz Yiyuanzzz commented May 9, 2023

Summary

Initialize host resource manager with available host resource values. Host resource manager will keep track of the available host resources as tasks start and stop. Resources include host port, CPU, memory, and GPU. New methods and tests will be followed up in later implementations.

Implementation details

  • Added a new host_resource_manager.go file in engine for host resource manager initialization.
  • Fixed existing constructors and tests with new host resource manager being added.
  • Ran make gogenerate to generate mocks in api_mocks.go
  • Keep the existing getResources private and added new GetHostResources method to interface.
  • Use exitcodes.ExitError exit code when agent unable to fetch host resources, agent will retry if agent fail to get host resource, if agent keep failing, agent will not be able to register container instance, customer will have some idle EC2 instances if they don’t have a proper setup to terminate these instances, but this is the same risk with or without the Agent restart behavior though.

Testing

Running the static checks within the PR.
make release-agent

New tests cover the changes: no

Description for the changelog

N/A

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@Yiyuanzzz Yiyuanzzz requested a review from a team as a code owner May 9, 2023 17:58
@Yiyuanzzz Yiyuanzzz force-pushed the feature/task-resource-accounting branch 3 times, most recently from f119b09 to 671c580 Compare May 12, 2023 17:22
@Yiyuanzzz Yiyuanzzz force-pushed the feature/task-resource-accounting branch from 671c580 to 8a7f922 Compare May 12, 2023 18:27
@Yiyuanzzz Yiyuanzzz force-pushed the feature/task-resource-accounting branch from 8a7f922 to 69e7dfe Compare May 12, 2023 19:28
Copy link
Contributor

@yinyic yinyic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Yiyuan! The PR looks good overall.

@@ -31,14 +32,15 @@ import (
func NewTaskEngine(cfg *config.Config, client dockerapi.DockerClient,
credentialsManager credentials.Manager,
containerChangeEventStream *eventstream.EventStream,
imageManager ImageManager, state dockerstate.TaskEngineState,
imageManager ImageManager, hostResources map[string]*ecs.Resource, state dockerstate.TaskEngineState,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we initiate hostResourceManager outside of TaskEngine, and pass it to the engine constructor? TaskEngine needs a hostResourceManager object, but it doesn't (or shouldn't) need to know how it's created.

Similarly, task engine also consumes the other managers such as credentialsManager and imageManager directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per offline discuss, will keep it as it is for now

hostResource map[string]*ecs.Resource
consumedResource map[string]*ecs.Resource

taskConsumed map[string]bool //task.arn to boolean whether host resources consumed or not
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit - if we are going to be using this map as a set (i.e. we only ever care about whether the key exists), we can use map[string]struct{} to indicate that the value doesn't matter. But current implementation is fine too.

func NewHostResourceManager(resourceMap map[string]*ecs.Resource) HostResourceManager {
consumedResourceMap := make(map[string]*ecs.Resource)
taskConsumed := make(map[string]bool)
// assigns CPU, MEMORY, PORTS, PORTS_UDP from host
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit - is it possible to make PORTS -> PORTS_TCP just to be clear?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will change it


// HostResourceManager keeps account of each task in
type HostResourceManager struct {
hostResource map[string]*ecs.Resource
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit - maybe rename to initialHostResource if we plan to keep it immutable?

Also can we add a comment for the variable to clarify that, for resources in resourceMap, some are "available resources" like CPU, mem, while some others are "reserved/consumed resources" like ports, so that readers don't get confused about the inconsistency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Yinyi, will update it!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yiyuanzzz Would it be more clearer to instead rename hostResource as initialHostResource and keep the struct name as HostResourceManager? As the consumeResource and taskConsumed will keep changing with state?

@@ -306,6 +306,18 @@ func (client *APIECSClient) getResources() ([]*ecs.Resource, error) {
return []*ecs.Resource{&cpuResource, &memResource, &portResource, &udpPortResource}, nil
}

func (client *APIECSClient) GetHostResources() (map[string]*ecs.Resource, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit. Could we add comments to this public function? See RegisterContainerInstance as an example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, will update it

}
}

hostResources, err := client.GetHostResources()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we move line 317-321 before line 301-315 to make it fail fast if agent cannot get host resources?

// Get host resources
hostResources, err := client.GetHostResources()
if err != nil {
    seelog.Critical("Unable to fetch host resources")
    return exitcodes.ExitError
}
// Get GPU devices
numGPUs := int64(0)
	if agent.cfg.GPUSupportEnabled {
		err := agent.initializeGPUManager()
		if err != nil {
			seelog.Criticalf("Could not initialize Nvidia GPU Manager: %v", err)
			return exitcodes.ExitError
		}
		// Find number of GPUs instance has
		platformDevices := agent.getPlatformDevices()
		for _, device := range platformDevices {
			if *device.Type == ecs.PlatformDeviceTypeGpu {
				numGPUs++
			}
		}
	}
// Update GPU in hostResources map
hostResources["GPU"] = &ecs.Resource{
    Name:         utils.Strptr("GPU"),
    Type:         utils.Strptr("INTEGER"),
    IntegerValue: &numGPUs,
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good to me!

hostResources, err := client.GetHostResources()
if err != nil {
seelog.Critical("Unable to fetch host resources")
return exitcodes.ExitError
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed with @prateekchaudhry offline. We should revisit this error code by referring logic here.

  • ExitError: (as well as unspecified exit codes) indicates a fatal error occurred, but the agent should be restarted
  • ExitTerminal: indicates the agent has exited unsuccessfully, but should not be restarted

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed with @yinyic , we probably should use ExitError because it may be fail to read the host resource due to some temporary reasons and a retry will work.

@@ -78,6 +79,20 @@ var apiVersions = []dockerclient.DockerVersion{
dockerclient.Version_1_22,
dockerclient.Version_1_23}
var capabilities []*ecs.Attribute
var testHostCPU = int64(1024)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit. Should they be constants instead of variables since we are not going to change them for different test cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we make those constants, then testHostCPU and testHostMEMORY are not addressable, we can not get address of them in var testHostResource = map[string]*ecs.Resource{ .....IntegerValue: &testHostCPU,}

@@ -78,6 +79,20 @@ var apiVersions = []dockerclient.DockerVersion{
dockerclient.Version_1_22,
dockerclient.Version_1_23}
var capabilities []*ecs.Attribute
var testHostCPU = int64(1024)
var testHostMEMORY = int64(1024)
var testHostResource = map[string]*ecs.Resource{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q. would we have any negative test case for "get host resources" in a follow-up PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so

// permissions and limitations under the License.

// Package engine contains the core logic for managing tasks

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit. Can we add more comments/contexts for this new manager? See agent/engine/docker_task_engine.go as an example. Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Chilinn, will add more comments

//PORTS
//Copying ports from host resources as consumed ports for initializing
ports := []*string{}
if resourceMap != nil && resourceMap["PORTS"] != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q. Could a nil resourceMap ever be passed to here? If yes, should we returned an error instead of creating a new manager? or should we avoid using a nil resourceMap to initialize the manager?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, a nil resourceMap will be passed to here when calling engine.NewTaskEngine(....nil..) in several test files,
for example in agent/statemanager/state_manager_test.go , taskEngine := engine.NewTaskEngine(&config.Config{}, nil, nil, nil, nil, nil, dockerstate.NewTaskEngineState(), nil, nil, nil, nil)
we add this logic to avoid nil pointer exceptions

@Yiyuanzzz Yiyuanzzz force-pushed the feature/task-resource-accounting branch 2 times, most recently from 888d2d9 to a82e500 Compare May 16, 2023 00:21
@Yiyuanzzz Yiyuanzzz merged commit 6caa2c2 into aws:feature/task-resource-accounting May 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants