Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

State not saving correctly on agent termination, resulting in incorrect task states #2618

Closed
tom-pryor opened this issue Sep 8, 2020 · 2 comments

Comments

@tom-pryor
Copy link

Summary

When the agent terminates the final state isn't saved due to a timeout. Upon restarting the state of the tasks vs the state of the actual containers are inconsistent.

Description

We first noticed a problem when we had tasks marked as RUNNING via the ECS control plane but upon inspection the container corresponding to the task was stopped. Obviously this causes serious problems because you can end up with whole services failing because while the ECS control plane reports tasks as RUNNING, no containers are actually running.

Upon termination the ecs-agent outputs a critical error message:

level=info time=2020-09-07T18:42:02Z msg="Agent received termination signal: terminated" module=termination_handler.go
level=critical time=2020-09-07T18:42:05Z msg="Error saving state before final shutdown: Multiple error:\n\t0: final save: timed out trying to save to disk" module=termination_handler.go

Upon restart it outputs the same warning relating to many tasks:

level=info time=2020-09-07T18:43:48Z msg="Restored from checkpoint file. I am running as 'arn:aws:ecs:eu-west-1:xxx:container-instance/xxx' in cluster 'xxx'" module=agent.go
level=info time=2020-09-07T18:43:48Z msg="Remaining mem: 7624" module=client.go
level=info time=2020-09-07T18:43:48Z msg="Registered container instance with cluster!" module=client.go
level=warn time=2020-09-07T18:43:48Z msg="Task engine [arn:aws:ecs:eu-west-1:xxx:task/25b829d7-091d-480d-8fa4-34c990398464]: could not find matching container for expected name []: Error: No such container: " module=docker_task_engine.go
level=warn time=2020-09-07T18:43:48Z msg="Task engine [arn:aws:ecs:eu-west-1:xxx:task/45559759-07a9-4829-a3c0-a4da8450c31a]: could not find matching container for expected name []: Error: No such container: " module=docker_task_engine.go
level=warn time=2020-09-07T18:43:48Z msg="Task engine [arn:aws:ecs:eu-west-1:xxx:task/95c72f1c-5934-4c40-b93b-2e6fa143f028]: could not find matching container for expected name []: Error: No such container: " module=docker_task_engine.go
level=warn time=2020-09-07T18:43:48Z msg="Task engine [arn:aws:ecs:eu-west-1:xxx:task/cce29eba-9e2c-46ea-94c5-552372887bca]: could not find matching container for expected name []: Error: No such container: " module=docker_task_engine.go
level=warn time=2020-09-07T18:43:48Z msg="Task engine [arn:aws:ecs:eu-west-1:xxx:task/cd0981db-0bd0-49c9-b15a-9e5e24ae46f8]: could not find matching container for expected name []: Error: No such container: " module=docker_task_engine.go
level=warn time=2020-09-07T18:43:48Z msg="Task engine [arn:aws:ecs:eu-west-1:xxx:task/f9c59e59-dae5-458c-93ef-f360b6dbcb93]: could not find matching container for expected name []: Error: No such container: " module=docker_task_engine.go
level=warn time=2020-09-07T18:43:48Z msg="Task engine [arn:aws:ecs:eu-west-1:xxx:task/0b6b3570-cdcd-478a-83b6-daee928cb7fc]: could not find matching container for expected name []: Error: No such container: " module=docker_task_engine.go
level=warn time=2020-09-07T18:43:48Z msg="Task engine [arn:aws:ecs:eu-west-1:xxx:task/48804e72-1443-46f2-9141-3ef6eded84c3]: could not find matching container for expected name []: Error: No such container: " module=docker_task_engine.go
level=warn time=2020-09-07T18:43:48Z msg="Task engine [arn:aws:ecs:eu-west-1:xxx:task/5934c040-ea83-4755-91f0-9508d7267d54]: could not find matching container for expected name []: Error: No such container: " module=docker_task_engine.go

Then the ECS agent starts logging seemingly endlessly the same messages about the tasks that are stuck in RUNNING:

Sept 08 10:21:38 2857d08d7e52 ecs-agent level=info time=2020-09-08T09:21:38Z msg="Managed task [arn:aws:ecs:eu-west-1:xxx:task/ce7355bb-0e8a-404e-8378-fdea7bf9001f]: Container [name=xxx runtimeID=61dfe4ce7a415e1c65bd6744fd998a351248d980d25554af8fcc46c3a8da2a4d]: handling container change event [NONE]" module=task_manager.go
Sept 08 10:21:38 2857d08d7e52 ecs-agent level=info time=2020-09-08T09:21:38Z msg="Managed task [arn:aws:ecs:eu-west-1:xxx:task/ce7355bb-0e8a-404e-8378-fdea7bf9001f]: Container [name=xxxruntimeID=61dfe4ce7a415e1c65bd6744fd998a351248d980d25554af8fcc46c3a8da2a4d]: container change RUNNING->NONE is redundant" module=task_manager.go

(n.b in the example above container id 61dfe4ce7a415e1c65bd6744fd998a351248d980d25554af8fcc46c3a8da2a4d is stopped)

Solution is either to stop the tasks via ECS API/Console or terminate the EC2 instance with the impacted agent. Note, draining the instance before the agent is restarted would presumably also fix the problem.

Perhaps this is related to the the new boltdb storage for state? To me this suggests other problems are well aside from when the agent restarts, what happens if it crashes or there is some other unexpected restart of the ec2 instance or docker? Will the state be inconsistent then?

Expected Behavior

The agent correctly saves the state before shutting down and correctly remembers the state after restarting. Tasks shouldn't be marked as RUNNING when their container isn't.

Observed Behavior

Agent doesn't correctly remember state upon shutting down and restarts in an inconsistent state. Tasks are left with state RUNNING but their containers aren't actually running.

Environment Details

Containers: 430
 Running: 18
 Paused: 0
 Stopped: 412
Images: 33
Server Version: 18.06.3-ce
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 468a545b9edcd5932818eb9de8e72413e616e86e
runc version: a592beb5bc4c4092b1b1bac971afed27687340c5
init version: fec3683b971d9c3ef73f284f176672c44b448662
Security Options:
 seccomp
  Profile: default
 selinux
Kernel Version: 4.19.143-flatcar
Operating System: Flatcar Container Linux by Kinvolk 2512.4.0 (Oklo)
OSType: linux
Architecture: x86_64
xxx
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false
{"Cluster":"xxx","ContainerInstanceArn":"arn:aws:ecs:eu-west-1:xxx:container-instance/xxx","Version":"Amazon ECS Agent - v1.44.3 (33b647b8)"}

Supporting Log Snippets

See above for snippets. Happy to be contacted for the tarball from the log collector.

@fenxiong
Copy link
Contributor

fenxiong commented Sep 8, 2020

Hi,
Sorry to hear that you had the issue. Currently, due to a bug introduced when switching to boltDB, the agent is relying on termination signal handler to store the container ID and name, and if the agent fails to save the state at termination, it will lose track of the containers. We have fixed the bug with #2608 and #2609, such that the agent will save the container ID and name when creating the container, so that one saving failure will just affect one container instead of all containers (which matches the behavior with previous agent that doesn't use boltDB). The fixes will go out in our next release. The current workaround before our next release would be to downgrade the agent to 1.43.0 or before.

I will mark this as a bug and we will let you know when the fix is released.

@ellenthsu
Copy link

This fix has been released as part of ECS Agent 1.44.4: https://github.com/aws/amazon-ecs-agent/releases/tag/1.44.4

Please perform an update of the Agent: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-agent-update.html or you can find the latest ECS Optimized AMIs containing ECS Agent 1.44.4 here: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-optimized_AMI.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants