State not saving correctly on agent termination, resulting in incorrect task states #2618

tom-pryor · 2020-09-08T11:55:26Z

Summary

When the agent terminates the final state isn't saved due to a timeout. Upon restarting the state of the tasks vs the state of the actual containers are inconsistent.

Description

We first noticed a problem when we had tasks marked as RUNNING via the ECS control plane but upon inspection the container corresponding to the task was stopped. Obviously this causes serious problems because you can end up with whole services failing because while the ECS control plane reports tasks as RUNNING, no containers are actually running.

Upon termination the ecs-agent outputs a critical error message:

level=info time=2020-09-07T18:42:02Z msg="Agent received termination signal: terminated" module=termination_handler.go
level=critical time=2020-09-07T18:42:05Z msg="Error saving state before final shutdown: Multiple error:\n\t0: final save: timed out trying to save to disk" module=termination_handler.go

Upon restart it outputs the same warning relating to many tasks:

level=info time=2020-09-07T18:43:48Z msg="Restored from checkpoint file. I am running as 'arn:aws:ecs:eu-west-1:xxx:container-instance/xxx' in cluster 'xxx'" module=agent.go
level=info time=2020-09-07T18:43:48Z msg="Remaining mem: 7624" module=client.go
level=info time=2020-09-07T18:43:48Z msg="Registered container instance with cluster!" module=client.go
level=warn time=2020-09-07T18:43:48Z msg="Task engine [arn:aws:ecs:eu-west-1:xxx:task/25b829d7-091d-480d-8fa4-34c990398464]: could not find matching container for expected name []: Error: No such container: " module=docker_task_engine.go
level=warn time=2020-09-07T18:43:48Z msg="Task engine [arn:aws:ecs:eu-west-1:xxx:task/45559759-07a9-4829-a3c0-a4da8450c31a]: could not find matching container for expected name []: Error: No such container: " module=docker_task_engine.go
level=warn time=2020-09-07T18:43:48Z msg="Task engine [arn:aws:ecs:eu-west-1:xxx:task/95c72f1c-5934-4c40-b93b-2e6fa143f028]: could not find matching container for expected name []: Error: No such container: " module=docker_task_engine.go
level=warn time=2020-09-07T18:43:48Z msg="Task engine [arn:aws:ecs:eu-west-1:xxx:task/cce29eba-9e2c-46ea-94c5-552372887bca]: could not find matching container for expected name []: Error: No such container: " module=docker_task_engine.go
level=warn time=2020-09-07T18:43:48Z msg="Task engine [arn:aws:ecs:eu-west-1:xxx:task/cd0981db-0bd0-49c9-b15a-9e5e24ae46f8]: could not find matching container for expected name []: Error: No such container: " module=docker_task_engine.go
level=warn time=2020-09-07T18:43:48Z msg="Task engine [arn:aws:ecs:eu-west-1:xxx:task/f9c59e59-dae5-458c-93ef-f360b6dbcb93]: could not find matching container for expected name []: Error: No such container: " module=docker_task_engine.go
level=warn time=2020-09-07T18:43:48Z msg="Task engine [arn:aws:ecs:eu-west-1:xxx:task/0b6b3570-cdcd-478a-83b6-daee928cb7fc]: could not find matching container for expected name []: Error: No such container: " module=docker_task_engine.go
level=warn time=2020-09-07T18:43:48Z msg="Task engine [arn:aws:ecs:eu-west-1:xxx:task/48804e72-1443-46f2-9141-3ef6eded84c3]: could not find matching container for expected name []: Error: No such container: " module=docker_task_engine.go
level=warn time=2020-09-07T18:43:48Z msg="Task engine [arn:aws:ecs:eu-west-1:xxx:task/5934c040-ea83-4755-91f0-9508d7267d54]: could not find matching container for expected name []: Error: No such container: " module=docker_task_engine.go

Then the ECS agent starts logging seemingly endlessly the same messages about the tasks that are stuck in RUNNING:

Sept 08 10:21:38 2857d08d7e52 ecs-agent level=info time=2020-09-08T09:21:38Z msg="Managed task [arn:aws:ecs:eu-west-1:xxx:task/ce7355bb-0e8a-404e-8378-fdea7bf9001f]: Container [name=xxx runtimeID=61dfe4ce7a415e1c65bd6744fd998a351248d980d25554af8fcc46c3a8da2a4d]: handling container change event [NONE]" module=task_manager.go
Sept 08 10:21:38 2857d08d7e52 ecs-agent level=info time=2020-09-08T09:21:38Z msg="Managed task [arn:aws:ecs:eu-west-1:xxx:task/ce7355bb-0e8a-404e-8378-fdea7bf9001f]: Container [name=xxxruntimeID=61dfe4ce7a415e1c65bd6744fd998a351248d980d25554af8fcc46c3a8da2a4d]: container change RUNNING->NONE is redundant" module=task_manager.go

(n.b in the example above container id 61dfe4ce7a415e1c65bd6744fd998a351248d980d25554af8fcc46c3a8da2a4d is stopped)

Solution is either to stop the tasks via ECS API/Console or terminate the EC2 instance with the impacted agent. Note, draining the instance before the agent is restarted would presumably also fix the problem.

Perhaps this is related to the the new boltdb storage for state? To me this suggests other problems are well aside from when the agent restarts, what happens if it crashes or there is some other unexpected restart of the ec2 instance or docker? Will the state be inconsistent then?

Expected Behavior

The agent correctly saves the state before shutting down and correctly remembers the state after restarting. Tasks shouldn't be marked as RUNNING when their container isn't.

Observed Behavior

Agent doesn't correctly remember state upon shutting down and restarts in an inconsistent state. Tasks are left with state RUNNING but their containers aren't actually running.

Environment Details

Containers: 430
 Running: 18
 Paused: 0
 Stopped: 412
Images: 33
Server Version: 18.06.3-ce
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 468a545b9edcd5932818eb9de8e72413e616e86e
runc version: a592beb5bc4c4092b1b1bac971afed27687340c5
init version: fec3683b971d9c3ef73f284f176672c44b448662
Security Options:
 seccomp
  Profile: default
 selinux
Kernel Version: 4.19.143-flatcar
Operating System: Flatcar Container Linux by Kinvolk 2512.4.0 (Oklo)
OSType: linux
Architecture: x86_64
xxx
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

{"Cluster":"xxx","ContainerInstanceArn":"arn:aws:ecs:eu-west-1:xxx:container-instance/xxx","Version":"Amazon ECS Agent - v1.44.3 (33b647b8)"}

Supporting Log Snippets

See above for snippets. Happy to be contacted for the tarball from the log collector.

The text was updated successfully, but these errors were encountered:

fenxiong · 2020-09-08T16:00:54Z

Hi,
Sorry to hear that you had the issue. Currently, due to a bug introduced when switching to boltDB, the agent is relying on termination signal handler to store the container ID and name, and if the agent fails to save the state at termination, it will lose track of the containers. We have fixed the bug with #2608 and #2609, such that the agent will save the container ID and name when creating the container, so that one saving failure will just affect one container instead of all containers (which matches the behavior with previous agent that doesn't use boltDB). The fixes will go out in our next release. The current workaround before our next release would be to downgrade the agent to 1.43.0 or before.

I will mark this as a bug and we will let you know when the fix is released.

ellenthsu · 2020-09-18T22:36:45Z

This fix has been released as part of ECS Agent 1.44.4: https://github.com/aws/amazon-ecs-agent/releases/tag/1.44.4

Please perform an update of the Agent: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-agent-update.html or you can find the latest ECS Optimized AMIs containing ECS Agent 1.44.4 here: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/ecs-optimized_AMI.html

fenxiong added kind/bug pending release labels Sep 8, 2020

ellenthsu closed this as completed Sep 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

State not saving correctly on agent termination, resulting in incorrect task states #2618

State not saving correctly on agent termination, resulting in incorrect task states #2618

tom-pryor commented Sep 8, 2020

fenxiong commented Sep 8, 2020 •

edited

Loading

ellenthsu commented Sep 18, 2020

State not saving correctly on agent termination, resulting in incorrect task states #2618

State not saving correctly on agent termination, resulting in incorrect task states #2618

Comments

tom-pryor commented Sep 8, 2020

Summary

Description

Expected Behavior

Observed Behavior

Environment Details

Supporting Log Snippets

fenxiong commented Sep 8, 2020 • edited Loading

ellenthsu commented Sep 18, 2020

fenxiong commented Sep 8, 2020 •

edited

Loading