Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes a CPUUtilization metrics calculation bug for Windows clusters. #1219

Closed
wants to merge 1 commit into from

Conversation

bboerst
Copy link
Contributor

@bboerst bboerst commented Jan 29, 2018

Summary

This PR fixes a calculation error witnessed while collecting CPUUsage.TotalUsage on Windows. The existing calculation resulted in an unusable value being sent back to AWS as 'CPUUtilization' and was impossible to derive scaling policies off of it. Converting this value to a percentage during stat collection now results in accurate metrics.

Implementation details

cpuUsage is converted to a percentage (* 100) then divided by numCores inside of the dockerStatsToContainerStats function.

Testing

This change was compiled and tested on a Windows ECS cluster running a number of different instance type configurations comprising of multiple vCPU counts in order to validate the new calculation.

  • Builds on Linux (make release)
  • Builds on Windows (go build -out amazon-ecs-agent.exe ./agent)
  • Unit tests on Linux (make test) pass
  • Unit tests on Windows (go test -timeout=25s ./agent/...) pass
  • Integration tests on Linux (make run-integ-tests) pass
  • Integration tests on Windows (.\scripts\run-integ-tests.ps1) pass
  • Functional tests on Linux (make run-functional-tests) pass
  • Functional tests on Windows (.\scripts\run-functional-tests.ps1) pass

New tests cover the changes: no

Description for the changelog

  • Bug - Fixed a CPUUtilization metrics calculation bug for Windows clusters.

Licensing

This contribution is under the terms of the Apache 2.0 License: yes

@samuelkarp
Copy link
Contributor

Hey, thanks for opening this PR.

This PR fixes a calculation error witnessed while collecting CPUUsage.TotalUsage on Windows. The existing calculation resulted in an unusable value being sent back to AWS as 'CPUUtilization' and was impossible to derive scaling policies off of it.

Can you describe the bug and how you were running into it? Were you seeing issues with cluster metrics or with service metrics?

@bboerst
Copy link
Contributor Author

bboerst commented Jan 29, 2018

Hi @samuelkarp.
We noticed both cluster metrics and service metrics being impacted by this bug. Here's a snapshot of the cloudwatch metrics from running 1.16.1 on our cluster, with annotations:

https://imgur.com/dGtj3Ft

You can see how the values don't make any sense.

Here's a snapshot of the 'service' after applying the change in this PR which is now accurately showing the percentage.

https://imgur.com/ei0jAHL

This snapshot properly shows the CPU bursting and various autoscalings happening at > 80% CPUUtilization.

@bboerst
Copy link
Contributor Author

bboerst commented Jan 29, 2018

Another snapshot demonstrating both the cluster and the service with incorrect metrics on 1.16.1:

https://imgur.com/ffQwVkh

...and one more after the fix in this PR illustrating correctly reporting metrics for cluster and service using autoscaling on a cluster of r4.2xlarge instances.

https://imgur.com/GJG9mFg

@richardpen
Copy link

@bboerst I verified this is actually a bug for windows, thanks for the fix. Can you confirm that this contribution is under the terms of the Apache 2.0 license?

Thanks,
Peng

@bboerst
Copy link
Contributor Author

bboerst commented Jan 29, 2018

@richardpen Confirmed. This contribution is submitted by me under the terms of the Apache 2.0 license.

@bboerst bboerst force-pushed the fix-windows-cpuutilization-metric branch from b402122 to b6ac749 Compare January 30, 2018 23:56
@richardpen
Copy link

@bboerst Thanks for your contributions, I have picked your commit and added some tests in #1224.

Thanks,
Peng

@richardpen richardpen closed this Jan 31, 2018
@richardpen richardpen reopened this Jan 31, 2018
@richardpen richardpen added this to the 1.17.0 milestone Jan 31, 2018
@richardpen
Copy link

#1224 has been merged, rebasing isn't automatically closing this one. So manually closing this one.

@richardpen richardpen closed this Feb 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants