Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update readme for unstable reserved memory value reported when ECS_POLL_METRICS is enabled and ECS_POLLING_METRICS_WAIT_DURATION is set to a high value #3863

Closed
wants to merge 0 commits into from

Conversation

Realmonia
Copy link
Contributor

@Realmonia Realmonia commented Aug 22, 2023

Summary

Update readme for unstable reserved memory value reported when ECS_POLL_METRICS is enabled and ECS_POLLING_METRICS_WAIT_DURATION is set to a high value

Background: During experiment (instance type m5.4xlarge, latest ECS optimized AMI), AL2023 do not see an issue with ECS_POLLING_METRICS_WAIT_DURATION = 20s, while AL2 see reserved memory "vibrates" when ECS_POLLING_METRICS_WAIT_DURATION equals 19s or 20s. The issue only appears when high number of tasks running on instance (in the experiment 500 tasks, 950 containers), in the case that lower tasks/containers count, reserved memory will not be impacted. This is likely due to docker metrics response latency when overloaded, and therefore causing metrics to miss reporting window (20s).

Considering the different behavior among different compute type and workload, and we have a solid default value (10s), I decided to not change specific agent handling code, but instead update readme to raise user's caution around this issue under specific circumstance.

Implementation details

Testing

New tests cover the changes: N/A

Description for the changelog

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@Realmonia Realmonia marked this pull request as ready for review August 22, 2023 01:58
@Realmonia Realmonia requested a review from a team as a code owner August 22, 2023 01:58
README.md Outdated Show resolved Hide resolved
danehlim
danehlim previously approved these changes Aug 23, 2023
@prateekchaudhry
Copy link
Contributor

where reserved memory of ECS cluster becomes unstable due to missing metrics sample at metric collection time

I am little unclear here, does this relate to ECS_RESERVED_MEMORY/config.ReservedMemory? If that is a config, how does that fluctuate with polling metrics duration?

@Realmonia
Copy link
Contributor Author

where reserved memory of ECS cluster becomes unstable due to missing metrics sample at metric collection time

I am little unclear here, does this relate to ECS_RESERVED_MEMORY/config.ReservedMemory? If that is a config, how does that fluctuate with polling metrics duration?

No it's a different config in task definition that is the memory limit of task/container. This data is collected in metrics to show the theoretical limit of utilization.

@prateekchaudhry
Copy link
Contributor

prateekchaudhry commented Aug 25, 2023

I see, is it the task level memory limit? And total memory reserved using it, or something similar to that? In that case, I wonder if this could be rephrased to make it distinct from the 'other' Reserved Memory? (non blocking)

of ECS cluster

In hindsight this does makes it clearer

@Realmonia
Copy link
Contributor Author

I see, is it the task level memory limit? And total memory reserved using it, or something similar to that? In that case, I wonder if this could be rephrased to make it distinct from the 'other' Reserved Memory? (non blocking)

of ECS cluster

In hindsight this does makes it clearer

Total memory does not use it; ECS_RESERVED_MEMORY is a value we subtract when we calculate total available memory during RCI call, it is the memory that is projected to be used by agent managed processes. This reserved memory is for scaling purpose, so if certain task has reserved memory, ECS scheduling procedure will guarantee those tasks can use this amount of memory when needed.

@Realmonia
Copy link
Contributor Author

I will change all "memory reserved" to "memory reservation value in metrics" to avoid confusion. That SGTY?

@Realmonia
Copy link
Contributor Author

Not able to trigger gpu integ tests. Force pushed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants