Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug where pause container was not always cleaned up #2824

Merged
merged 1 commit into from
Mar 2, 2021

Conversation

angelcar
Copy link
Contributor

@angelcar angelcar commented Mar 2, 2021

Summary

This fixes a bug that can prevent the pause container network to be cleaned up when ECS Agent is restarted due to dockerd restarts. Failing to clean the pause container network might cause issues for future tasks to start.

When dockerd is restarted, all running containers (along with ECS Agent) are stopped. After dockerd and ECS Agent restart successfully, the latter proceeds to stop all the tasks that were running previous to the restart since it realizes the respective containers are no longer running. When this happens, the pause container network is not cleaned up since the workflow is different to the normal operations.

Implementation details

Do a check after the managed tasks is stopped to see if the pause container still needs to be cleaned up. Most of the times the pause container has already been cleaned up by the time the managed task stops, but as mentioned above, there are scenarios where this might not happen.

Since now we are invoking container cleanup in more than one place (previously it only happened in stopContainer method), the operation was made idempotent by storing a flag in the container to indicate whether that container was torn down.
For now that flag is only used for the pause container, but might be useful for other containers that need explicit teardown in the future.

Testing

There are a number of unit and integration tests that check for proper pause container cleanup. Those tests should still pass since the previous logic is not modified and a few new assertions are added to check if the new tear down flag was set.

In addition, manually tested that pause container network is cleaned under normal circumstances as well as after docker restarts.

New tests cover the changes: yes

Description for the changelog

Fix bug where pause container was not always cleaned up

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@angelcar angelcar force-pushed the angelcar_pause_container_cleanup_fix branch 4 times, most recently from 8f97f8b to 23db43c Compare March 2, 2021 18:34
// checkTearDownPauseContainer idempotently tears down the pause container network when the pause container's known
//or desired status is stopped.
func (engine *DockerTaskEngine) checkTearDownPauseContainer(task *apitask.Task) {
for _, container := range task.Containers {
Copy link
Contributor

@sharanyad sharanyad Mar 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of iterating through all containers, can we check if task is not awsvpc enabled (GetENI == nil) and return quickly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it time or readability that would be improved? In terms of time, there can be a max of 10 containers so we would gain a few nanoseconds :p

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as a practice, just like to return quickly wherever possible :) also, containers count could be increased in the future

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. changed.

@angelcar angelcar force-pushed the angelcar_pause_container_cleanup_fix branch from 23db43c to 4229491 Compare March 2, 2021 19:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants