Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce jitter for task cleanup wait duration #2969

Merged
merged 1 commit into from
Aug 4, 2021

Conversation

fenxiong
Copy link
Contributor

@fenxiong fenxiong commented Jul 30, 2021

Summary

Closes #2968.

Introduce jitter for task cleanup wait duration. Configurable via ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION_JITTER environment variable. The purpose of introducing this is to address the use case where a large number of tasks are stopped at the same time. Without the jitter, the cleanup for all those tasks also happen at roughly the same time, which can generate a lot of work that could impact the tasks that are running at the time of such cleanup. With the jitter, each task will be cleaned up at different time, avoiding the aforementioned impact.

Implementation details

Added a new config field for the jitter. Use it in task manager to calculate the cleanup duration with jitter.

The default value of the jitter is an empty duration, so won't change existing behavior when the new env is not specified.

Testing

Added unit test. Manually tested successfully with the following:
(1) ecs.config set with jitter:

ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION=1m
ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION_JITTER=2m

launched 10 tasks that all stop at the same time. verified the tasks get cleaned up at different random time between 1m-3m after the tasks stopped:

cat /var/log/ecs/ecs-agent.log | grep "Cleaning up"
level=info time=2021-07-30T20:46:22Z msg="Cleaning up task's containers and data" taskARN="arn:aws:ecs:us-west-2:xx:task/test-jitter/xx"
level=info time=2021-07-30T20:46:31Z msg="Cleaning up task's containers and data" taskARN="arn:aws:ecs:us-west-2:xx:task/test-jitter/xx"
level=info time=2021-07-30T20:46:49Z msg="Cleaning up task's containers and data" taskARN="arn:aws:ecs:us-west-2:xx:task/test-jitter/xx"
level=info time=2021-07-30T20:46:50Z msg="Cleaning up task's containers and data" taskARN="arn:aws:ecs:us-west-2:xx:task/test-jitter/xx"
level=info time=2021-07-30T20:46:53Z msg="Cleaning up task's containers and data" taskARN="arn:aws:ecs:us-west-2:xx:task/test-jitter/xx"
level=info time=2021-07-30T20:47:46Z msg="Cleaning up task's containers and data" taskARN="arn:aws:ecs:us-west-2:xx:task/test-jitter/xx"
level=info time=2021-07-30T20:47:48Z msg="Cleaning up task's containers and data" taskARN="arn:aws:ecs:us-west-2:xx:task/test-jitter/xx"
level=info time=2021-07-30T20:47:56Z msg="Cleaning up task's containers and data" taskARN="arn:aws:ecs:us-west-2:xx:task/test-jitter/xx"
level=info time=2021-07-30T20:48:16Z msg="Cleaning up task's containers and data" taskARN="arn:aws:ecs:us-west-2:xx:task/test-jitter/xx"
level=info time=2021-07-30T20:48:19Z msg="Cleaning up task's containers and data" taskARN="arn:aws:ecs:us-west-2:xx:task/test-jitter/xx"

(2) ecs.config set without jitter:

ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION=1m

and try the same again. the tasks are roughly cleaned up at the same time:

level=info time=2021-07-30T20:51:03Z msg="Cleaning up task's containers and data" taskARN="arn:aws:ecs:us-west-2:xx:task/test-jitter/xx"
level=info time=2021-07-30T20:51:03Z msg="Cleaning up task's containers and data" taskARN="arn:aws:ecs:us-west-2:xx:task/test-jitter/xx"
level=info time=2021-07-30T20:51:03Z msg="Cleaning up task's containers and data" taskARN="arn:aws:ecs:us-west-2:xx:task/test-jitter/xx"
level=info time=2021-07-30T20:51:03Z msg="Cleaning up task's containers and data" taskARN="arn:aws:ecs:us-west-2:xx:task/test-jitter/xx"
level=info time=2021-07-30T20:51:03Z msg="Cleaning up task's containers and data" taskARN="arn:aws:ecs:us-west-2:xx:task/test-jitter/xx"
level=info time=2021-07-30T20:51:03Z msg="Cleaning up task's containers and data" taskARN="arn:aws:ecs:us-west-2:xx:task/test-jitter/xx"
level=info time=2021-07-30T20:51:03Z msg="Cleaning up task's containers and data" taskARN="arn:aws:ecs:us-west-2:xx:task/test-jitter/xx"
level=info time=2021-07-30T20:51:03Z msg="Cleaning up task's containers and data" taskARN="arn:aws:ecs:us-west-2:xx:task/test-jitter/xx"
level=info time=2021-07-30T20:51:03Z msg="Cleaning up task's containers and data" taskARN="arn:aws:ecs:us-west-2:xx:task/test-jitter/xx"
level=info time=2021-07-30T20:51:04Z msg="Cleaning up task's containers and data" taskARN="arn:aws:ecs:us-west-2:xx:task/test-jitter/xx"

New tests cover the changes: yes

Description for the changelog

Enhancement - Introduce optional jitter for task cleanup wait duration, configurable via ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION_JITTER environment variable. In use case where there are large number of tasks being stopped at the same time, specifying this jitter can help avoid all the task cleanup happening at the same time (the latter could add pressure to the instance and as a result affect running tasks).

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Configurable via ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION_JITTER environment variable. The purpose of introducing this is to address the use case where a large number of tasks are stopped at the same time. Without the jitter, the cleanup for all those tasks also happen at roughly the same time, which can generate a lot of work that could impact the tasks that are running at the time of such cleanup. With the jitter, each task will be cleaned up at different time, avoiding the aforementioned impact.
@fenxiong fenxiong marked this pull request as ready for review July 30, 2021 22:37
@fenxiong fenxiong requested a review from a team July 30, 2021 22:37
@fenxiong fenxiong merged commit 5088fc8 into aws:dev Aug 4, 2021
@fenxiong fenxiong deleted the wait-jitter branch August 4, 2021 18:38
@fenxiong fenxiong mentioned this pull request Aug 5, 2021
@fenxiong fenxiong added this to the 1.55.0 milestone Aug 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants