Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix an edge case for container ordering success condition #2404

Merged
merged 1 commit into from
Mar 24, 2020

Conversation

fenxiong
Copy link
Contributor

@fenxiong fenxiong commented Mar 20, 2020

Summary

Fix an edge case for container ordering success condition.

Currently, if a container being depended on success condition has a non-empty exit code while its status is running, the task is marked as failed. However, under sufficient load (100+ containers), when running a container that exits quickly, Docker can send container running event along with the exit code, so in that case we will store the exit code of the container while its status is still RUNNING. This causes the agent to fail the task when such container is depended on by others with success condition:

...
level=debug time=2020-03-20T18:05:41Z msg="Managed task [arn:aws:ecs:us-west-2:xxx:task/test-efs-order/xxx]: can't apply state to container [reader] yet due to unresolved dependencies: dependency graph: failed to resolve container ordering dependency [writer(busybox) (RUNNING->RUNNING) - Exit: 0] for target [reader(busybox) (NONE->RUNNING)] as dependency did not exit successfully." module=task_manager.go
level=critical time=2020-03-20T18:05:41Z msg="Managed task [arn:aws:ecs:us-west-2:xxx:task/test-efs-order/xxx]: task in a bad state; it's not steadystate but no containers want to transition" module=task_manager.go

Fixing this case by not failing the success condition check unless the dependent container is both stopped and has a non-successful exit code.

Implementation details

Adjust logic in dependency graph to reflect changes described above.

Testing

Added a test case in unit test. Manually reproduced the issue with around 100 tasks that has a fast exiting container and a container depends on it with success condition, and verified that this change fixes the issue.

New tests cover the changes: yes

Description for the changelog

Fix an edge case that can cause task failed to start when using container ordering success condition.

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

The container running event from Docker can contain exit code of the container. Don't fail the task if a container being dependent on success has a non empty exit code with running status.
@fenxiong fenxiong requested a review from a team March 20, 2020 23:51
@fenxiong fenxiong added this to the 1.39.0 milestone Mar 23, 2020
@fenxiong fenxiong merged commit 8f4902d into aws:dev Mar 24, 2020
@fenxiong fenxiong mentioned this pull request Apr 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants