Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

engine: don't stop container when applied status is running #1446

Merged
merged 1 commit into from
Aug 1, 2018

Conversation

haikuoliu
Copy link
Contributor

@haikuoliu haikuoliu commented Jul 11, 2018

Summary

This PR aims to fix this issue.

Implementation details

Do not mark container as stopped when its applied status is running.

Testing

  • Builds on Linux (make release)
  • Builds on Windows (go build -out amazon-ecs-agent.exe ./agent)
  • Unit tests on Linux (make test) pass
  • Unit tests on Windows (go test -timeout=25s ./agent/...) pass
  • Integration tests on Linux (make run-integ-tests) pass
  • Integration tests on Windows (.\scripts\run-integ-tests.ps1) pass
  • Functional tests on Linux (make run-functional-tests) pass
  • Functional tests on Windows (.\scripts\run-functional-tests.ps1) pass

New tests cover the changes: yes

Description for the changelog

Bug - Fixed a bug where container marked as stopped comes back with a running status

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@haikuoliu haikuoliu requested a review from a team July 11, 2018 17:31
@haikuoliu haikuoliu changed the title engine: allow each container to re-stop once [WIP] engine: allow each container to re-stop once Jul 12, 2018
@haikuoliu haikuoliu added this to the 1.19.0 milestone Jul 13, 2018
@haikuoliu haikuoliu changed the title [WIP] engine: allow each container to re-stop once engine: don't stop container when applied status is running Jul 14, 2018
@adnxn adnxn modified the milestones: 1.19.0, 1.20.0 Jul 16, 2018
Copy link

@richardpen richardpen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you manually reproduce this issue and verify this actually fix the issue?

@haikuoliu
Copy link
Contributor Author

@richardpen Yes, I used the following task definition and script:

Task definition:

{
  "containerDefinitions": [
    {
      "name": "sleep1",
      "image": "busybox",
      "cpu": 10,
      "command": [
        "sleep",
        "36000000"
      ],
      "memory": 10,
      "essential": true
    },
    {
      "name": "sleep2",
      "image": "busybox",
      "cpu": 10,
      "command": [
        "sleep",
        "36000000"
      ],
      "memory": 10,
      "essential": true
    },
    {
      "name": "sleep3",
      "image": "busybox",
      "cpu": 10,
      "command": [
        "sleep",
        "36000000"
      ],
      "memory": 10,
      "essential": true
    },
    {
      "name": "sleep4",
      "image": "busybox",
      "cpu": 10,
      "command": [
        "sleep",
        "36000000"
      ],
      "memory": 10,
      "essential": true
    }
  ],
  "family": "sleep_forever"
}

Test script:

aws ecs register-task-definition --cli-input-json file:$path_to_task_def/SleepForever.json

task=$(aws ecs run-task --cluster HaikuoCluster --task-definition sleep_forever:1)

sleep 0.00001

task_arn1=$(echo $task | jq -r '.tasks[0].taskArn')

aws ecs stop-task --cluster HaikuoCluster --task $task_arn1

The first time I run the above script with the unfixed Agent, I got similar logs that indicate there are redundant container changes from RUNNING to STOPPED, and one container is re-stopped once. Part of the Agent logs:

2018-07-18T21:37:52Z [WARN] Managed task [arn:aws:ecs:us-west-2:12345678:task/1f24b809-39e5-4494-9cde-4ca7452e6a69]: stopped container [sleep3] came back; re-stopping it once
2018-07-18T21:37:52Z [INFO] Managed task [arn:aws:ecs:us-west-2:12345678:task/1f24b809-39e5-4494-9cde-4ca7452e6a69]: redundant container state change. sleep3 to RUNNING, but already STOPPED
2018-07-18T21:37:52Z [INFO] Task engine [arn:aws:ecs:us-west-2:12345678:task/1f24b809-39e5-4494-9cde-4ca7452e6a69]: stopping container [sleep3]
2018-07-18T21:37:52Z [INFO] Managed task [arn:aws:ecs:us-west-2:12345678:task/1f24b809-39e5-4494-9cde-4ca7452e6a69]: Cgroup resource set up for task complete
2018-07-18T21:37:52Z [INFO] Managed task [arn:aws:ecs:us-west-2:12345678:task/1f24b809-39e5-4494-9cde-4ca7452e6a69]: redundant container state change. sleep3 to RUNNING, but already STOPPED
2018-07-18T21:37:52Z [INFO] Managed task [arn:aws:ecs:us-west-2:12345678:task/1f24b809-39e5-4494-9cde-4ca7452e6a69]: Cgroup resource set up for task complete
2018-07-18T21:37:52Z [INFO] Task engine [arn:aws:ecs:us-west-2:12345678:task/1f24b809-39e5-4494-9cde-4ca7452e6a69]: started docker container for task: sleep2 -> 3987035a9f214d5bfa5b48ea4a2fdb482310b5b47ddc976a25c0a26f2688bac2, took 428.703487ms
2018-07-18T21:37:52Z [INFO] Managed task [arn:aws:ecs:us-west-2:12345678:task/1f24b809-39e5-4494-9cde-4ca7452e6a69]: redundant container state change. sleep2 to RUNNING, but already STOPPED
2018-07-18T21:37:52Z [INFO] Managed task [arn:aws:ecs:us-west-2:12345678:task/1f24b809-39e5-4494-9cde-4ca7452e6a69]: Cgroup resource set up for task complete
2018-07-18T21:37:52Z [INFO] Managed task [arn:aws:ecs:us-west-2:12345678:task/1f24b809-39e5-4494-9cde-4ca7452e6a69]: redundant container state change. sleep2 to RUNNING, but already STOPPED
2018-07-18T21:37:52Z [INFO] Managed task [arn:aws:ecs:us-west-2:12345678:task/1f24b809-39e5-4494-9cde-4ca7452e6a69]: Cgroup resource set up for task complete
2018-07-18T21:37:52Z [INFO] Task engine [arn:aws:ecs:us-west-2:12345678:task/1f24b809-39e5-4494-9cde-4ca7452e6a69]: started docker container for task: sleep4 -> 5dadfcbb2a2fe48d8bea2b913203d4df66ed365c3f96309cd37b1c4d4f8db818, took 394.058229ms
2018-07-18T21:37:52Z [INFO] Managed task [arn:aws:ecs:us-west-2:12345678:task/1f24b809-39e5-4494-9cde-4ca7452e6a69]: redundant container state change. sleep4 to RUNNING, but already STOPPED
2018-07-18T21:37:52Z [INFO] Managed task [arn:aws:ecs:us-west-2:12345678:task/1f24b809-39e5-4494-9cde-4ca7452e6a69]: Cgroup resource set up for task complete
2018-07-18T21:37:52Z [INFO] Managed task [arn:aws:ecs:us-west-2:12345678:task/1f24b809-39e5-4494-9cde-4ca7452e6a69]: redundant container state change. sleep4 to RUNNING, but already STOPPED

And when I run docker container list, I got the following:

CONTAINER ID        IMAGE                            COMMAND                  CREATED             STATUS              PORTS                                 NAMES
85787ae0310b        amazon/amazon-ecs-agent:latest   "/agent"                 2 minutes ago       Up 2 minutes                                              ecs-agent
5dadfcbb2a2f        busybox                          "sleep 36000000"         14 minutes ago      Up 14 minutes                                             ecs-sleep_forever-1-sleep4-cefac39fcc9cf1f61c00
3987035a9f21        busybox                          "sleep 36000000"         14 minutes ago      Up 14 minutes                                             ecs-sleep_forever-1-sleep2-b2d086f0ca8bf8b05d00
90d65e49e811        amazon/amazon-ecs-agent:make     "/agent "                8 days ago          Up 8 days           51679/tcp, 0.0.0.0:32899->51678/tcp   romantic_brown
57117175245b        amazon/amazon-ecs-agent:make     "/agent "                9 days ago          Up 9 days           51679/tcp, 0.0.0.0:32858->51678/tcp   elastic_curran
769f4d6bb483        nginx                            "nginx -g 'daemon of…"   11 days ago         Up 11 days          443/tcp, 127.0.0.1:51671->80/tcp      test-ecs-registry-auth
52b5e007426f        registry                         "/bin/registry /etc/…"   11 days ago         Up 11 days          127.0.0.1:51670->5000/tcp             test-ecs-registry

You can see there are two containers that are still running.

But when I run the above script with the fixed Agent for multiple times, I cannot find any redundant container changes from RUNNING to STOPPED, and there are no extra running containers.

Part of the logs:

2018-07-18T21:52:31Z [INFO] TaskHandler: Adding event: TaskChange: [arn:aws:ecs:us-west-2:022701931793:task/d0a46c65-95fe-4867-9aa3-ea09a55c57b5 -> STOPPED, Known Sent: NONE, PullStartedAt: 0001-01-01 00:00:00 +0000 UTC, PullStoppedAt: 0001-01-01 00:00:00 +0000 UTC, ExecutionStoppedAt: 2018-07-18 21:52:31.257390761 +0000 UTC m=+116.605725512, arn:aws:ecs:us-west-2:022701931793:task/d0a46c65-95fe-4867-9aa3-ea09a55c57b5 sleep2 -> STOPPED, Known Sent: NONE, arn:aws:ecs:us-west-2:022701931793:task/d0a46c65-95fe-4867-9aa3-ea09a55c57b5 sleep3 -> STOPPED, Known Sent: NONE, arn:aws:ecs:us-west-2:022701931793:task/d0a46c65-95fe-4867-9aa3-ea09a55c57b5 sleep4 -> STOPPED, Known Sent: NONE, arn:aws:ecs:us-west-2:022701931793:task/d0a46c65-95fe-4867-9aa3-ea09a55c57b5 sleep1 -> STOPPED, Known Sent: NONE] sent: false
2018-07-18T21:52:31Z [INFO] TaskHandler: Sending task change: TaskChange: [arn:aws:ecs:us-west-2:022701931793:task/d0a46c65-95fe-4867-9aa3-ea09a55c57b5 -> STOPPED, Known Sent: NONE, PullStartedAt: 0001-01-01 00:00:00 +0000 UTC, PullStoppedAt: 0001-01-01 00:00:00 +0000 UTC, ExecutionStoppedAt: 2018-07-18 21:52:31.257390761 +0000 UTC m=+116.605725512, arn:aws:ecs:us-west-2:022701931793:task/d0a46c65-95fe-4867-9aa3-ea09a55c57b5 sleep2 -> STOPPED, Known Sent: NONE, arn:aws:ecs:us-west-2:022701931793:task/d0a46c65-95fe-4867-9aa3-ea09a55c57b5 sleep3 -> STOPPED, Known Sent: NONE, arn:aws:ecs:us-west-2:022701931793:task/d0a46c65-95fe-4867-9aa3-ea09a55c57b5 sleep4 -> STOPPED, Known Sent: NONE, arn:aws:ecs:us-west-2:022701931793:task/d0a46c65-95fe-4867-9aa3-ea09a55c57b5 sleep1 -> STOPPED, Known Sent: NONE] sent: false
2018-07-18T21:52:31Z [INFO] Managed task [arn:aws:ecs:us-west-2:022701931793:task/d0a46c65-95fe-4867-9aa3-ea09a55c57b5]: sent task change event [arn:aws:ecs:us-west-2:022701931793:task/d0a46c65-95fe-4867-9aa3-ea09a55c57b5 -> STOPPED, Known Sent: NONE, PullStartedAt: 0001-01-01 00:00:00 +0000 UTC, PullStoppedAt: 0001-01-01 00:00:00 +0000 UTC, ExecutionStoppedAt: 2018-07-18 21:52:31.257390761 +0000 UTC m=+116.605725512]
2018-07-18T21:52:31Z [INFO] Managed task [arn:aws:ecs:us-west-2:022701931793:task/d0a46c65-95fe-4867-9aa3-ea09a55c57b5]: redundant resource state change. cgroup to REMOVED, but already REMOVED
2018-07-18T21:52:31Z [INFO] Managed task [arn:aws:ecs:us-west-2:022701931793:task/d0a46c65-95fe-4867-9aa3-ea09a55c57b5]: redundant container state change. sleep3 to STOPPED, but already STOPPED
2018-07-18T21:52:31Z [INFO] Managed task [arn:aws:ecs:us-west-2:022701931793:task/d0a46c65-95fe-4867-9aa3-ea09a55c57b5]: redundant resource state change. cgroup to REMOVED, but already REMOVED
2018-07-18T21:52:31Z [INFO] Managed task [arn:aws:ecs:us-west-2:022701931793:task/d0a46c65-95fe-4867-9aa3-ea09a55c57b5]: redundant resource state change. cgroup to REMOVED, but already REMOVED
2018-07-18T21:52:31Z [INFO] Managed task [arn:aws:ecs:us-west-2:022701931793:task/d0a46c65-95fe-4867-9aa3-ea09a55c57b5]: redundant container state change. sleep4 to STOPPED, but already STOPPED
2018-07-18T21:52:31Z [INFO] Managed task [arn:aws:ecs:us-west-2:022701931793:task/d0a46c65-95fe-4867-9aa3-ea09a55c57b5]: redundant container state change. sleep1 to STOPPED, but already STOPPED
2018-07-18T21:52:31Z [INFO] Managed task [arn:aws:ecs:us-west-2:022701931793:task/d0a46c65-95fe-4867-9aa3-ea09a55c57b5]: redundant container state change. sleep4 to STOPPED, but already STOPPED
2018-07-18T21:52:31Z [INFO] Managed task [arn:aws:ecs:us-west-2:022701931793:task/d0a46c65-95fe-4867-9aa3-ea09a55c57b5]: redundant container state change. sleep1 to STOPPED, but already STOPPED
2018-07-18T21:52:31Z [INFO] Managed task [arn:aws:ecs:us-west-2:022701931793:task/d0a46c65-95fe-4867-9aa3-ea09a55c57b5]: redundant container state change. sleep1 to STOPPED, but already STOPPED
2018-07-18T21:52:31Z [INFO] Managed task [arn:aws:ecs:us-west-2:022701931793:task/d0a46c65-95fe-4867-9aa3-ea09a55c57b5]: redundant container state change. sleep1 to STOPPED, but already STOPPED

You can see all the containers are stopped.

@richardpen
Copy link

richardpen commented Jul 24, 2018

@haikuoliu Great, can you resolve the conflicts?

@haikuoliu
Copy link
Contributor Author

@richardpen Conflicts are resolved now.

@haikuoliu haikuoliu merged commit c052b18 into aws:dev Aug 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants