Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix an edge case that causes container dependency deadlock #2734

Merged
merged 1 commit into from
Nov 23, 2020

Conversation

fenxiong
Copy link
Contributor

Summary

Fix an edge case that can cause container dependency deadlock and task stuck in pending.

Scenario

Scenario: two containers container A and container B, container A is a non-essential container, container B has a dependency on container A, and we fail to pull container A's image.

Workflow leading to deadlock

  1. We start to pull container A's mage;
  2. We fail to pull container A's image;
  3. We try to create container A anyway (this is intended as we want to try cached image if possible);
  4. We fail to create the container A due to missing image;
  5. We try to stop container A;
  6. We can't stop container A because when container B has a dependency on A, we implicitly have a reversed dependency on shutdown, such that container A won't stop before container B stops;
  7. However we can't move forward container B either, because it's waiting on container A.
  8. Since container A is not an essential container, we don't mark task to stop when we have an error on A, but just keep transitioning A and B, and can't transition either one of them. So this ends up in a deadlock, and task ends up stuck in pending status.

Options to fix

On a high level, I see three ways to address this issue:

  1. Move forward to stop container A in bullet point 6 above to avoid the deadlock instead of waiting for the shutdown dependency that won't be fulfilled;
  2. Move forward to fail container B's dependency check in bullet point 7 above to avoid the deadlock;
  3. Directly mark A's known status as STOPPED in bullet point 5 since we never start it, instead of having to make a transition on it, which will have dependency check.

For option 1, it's somewhat tricky because the shutdown dependency is not modeled in container's dependsOn field. Instead it's implicit and is checked specially (

func verifyShutdownOrder(target *apicontainer.Container, existingContainers map[string]*apicontainer.Container) error {
). This makes implementing option 1 harder because it's hard to have a generic check on circular dependency without modeling the shutdown dependency, and if we want to model the shutdown dependency, that requires a much larger change.

For option 3, this has a much larger impact due to it not limited to task using container dependency. I'm not fully certain what other consequence there would be if we skip a transition while we used to have.

Therefore, option 2 is chosen.

Actual fix

As mentioned above, option 2 is chosen. Take the above scenario as example, this is done by returning an error when checking container B's dependency resolution when it sees that container A has not started and will never start.

Implementation details

Add a helper function for container to indicate whether it's in a state that it has not started and will not start in the future. In dependency check, return an error if the dependency container has entered such state.

Testing

Added unit test. Manually tested with the following task def that triggers the scenario mentioned in summary:

{
    "family": "test-nonessential-pull-success-td",
    "executionRoleArn": "arn:aws:iam::xxx:role/ecsTaskExecutionRole",
    "containerDefinitions": [
        {
            "name": "container_1",
            "image": "xxx.dkr.ecr.us-west-2.amazonaws.com/busybox:xxx",
            "essential": false,
            "memory": 256
        },
        {
            "name": "container_2",
            "image": "xxx.dkr.ecr.us-west-2.amazonaws.com/busybox:latest",
            "memory": 256,
            "command": ["sh", "-c", "echo hello; sleep infinity"],
            "dependsOn": [
            	{
            		"containerName": "container_1",
            		"condition": "SUCCESS"
            	}
            ]
        }
    ]
}

(container_1 intentionally has an invalid image to fail the pull)
Before the change, the above task def will stuck in pending. With the change, it stops with "Task failed to start" and the pull image failure can be seen in container_1's status.

I will add a functional test separately.

New tests cover the changes: yes

Description for the changelog

Bug - Fix an edge case that can cause container dependency deadlock

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

// 1. Container's known status is earlier than running;
// 2. Container's desired status is stopped;
// 3. Container is not in the middle a transition (indicated by applied status is none status).
func (c *Container) HasNotAndWillNotStart() bool {
Copy link
Contributor

@angelcar angelcar Nov 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use the unsafe versions of the variables and add a read lock here, since as of now this comparison is not atomic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense. updated

Copy link
Member

@fierlion fierlion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good find.

@fenxiong fenxiong added this to the 1.48.1 milestone Nov 18, 2020
Scenario: two containers container A and container B, container A is a non-essential container,  container B has a dependency on container A.

Workflow leading to deadlock:
1. We start to pull container A's mage;
2. We fail to pull container A's image;
3. We try to create container A anyway (this is intended as we want to try cached image if possible);
4. We fail to create the container A due to missing image;
5. We try to stop container A;
6. We can't stop container A because when container B has a dependency on A, we implicitly have a reversed dependency on shutdown, such that container A won't stop before container B stops;
7. However we can't move forward container B either, because it's waiting on container A. This ends up in a deadlock.

Fix: return an error when checking container B's dependency resolution when it sees that container A will never start.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants