You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When the agent terminates the final state isn't saved due to a timeout. Upon restarting the state of the tasks vs the state of the actual containers are inconsistent.
Description
We first noticed a problem when we had tasks marked as RUNNING via the ECS control plane but upon inspection the container corresponding to the task was stopped. Obviously this causes serious problems because you can end up with whole services failing because while the ECS control plane reports tasks as RUNNING, no containers are actually running.
Upon termination the ecs-agent outputs a critical error message:
level=info time=2020-09-07T18:42:02Z msg="Agent received termination signal: terminated" module=termination_handler.go
level=critical time=2020-09-07T18:42:05Z msg="Error saving state before final shutdown: Multiple error:\n\t0: final save: timed out trying to save to disk" module=termination_handler.go
Upon restart it outputs the same warning relating to many tasks:
level=info time=2020-09-07T18:43:48Z msg="Restored from checkpoint file. I am running as 'arn:aws:ecs:eu-west-1:xxx:container-instance/xxx' in cluster 'xxx'" module=agent.go
level=info time=2020-09-07T18:43:48Z msg="Remaining mem: 7624" module=client.go
level=info time=2020-09-07T18:43:48Z msg="Registered container instance with cluster!" module=client.go
level=warn time=2020-09-07T18:43:48Z msg="Task engine [arn:aws:ecs:eu-west-1:xxx:task/25b829d7-091d-480d-8fa4-34c990398464]: could not find matching container for expected name []: Error: No such container: " module=docker_task_engine.go
level=warn time=2020-09-07T18:43:48Z msg="Task engine [arn:aws:ecs:eu-west-1:xxx:task/45559759-07a9-4829-a3c0-a4da8450c31a]: could not find matching container for expected name []: Error: No such container: " module=docker_task_engine.go
level=warn time=2020-09-07T18:43:48Z msg="Task engine [arn:aws:ecs:eu-west-1:xxx:task/95c72f1c-5934-4c40-b93b-2e6fa143f028]: could not find matching container for expected name []: Error: No such container: " module=docker_task_engine.go
level=warn time=2020-09-07T18:43:48Z msg="Task engine [arn:aws:ecs:eu-west-1:xxx:task/cce29eba-9e2c-46ea-94c5-552372887bca]: could not find matching container for expected name []: Error: No such container: " module=docker_task_engine.go
level=warn time=2020-09-07T18:43:48Z msg="Task engine [arn:aws:ecs:eu-west-1:xxx:task/cd0981db-0bd0-49c9-b15a-9e5e24ae46f8]: could not find matching container for expected name []: Error: No such container: " module=docker_task_engine.go
level=warn time=2020-09-07T18:43:48Z msg="Task engine [arn:aws:ecs:eu-west-1:xxx:task/f9c59e59-dae5-458c-93ef-f360b6dbcb93]: could not find matching container for expected name []: Error: No such container: " module=docker_task_engine.go
level=warn time=2020-09-07T18:43:48Z msg="Task engine [arn:aws:ecs:eu-west-1:xxx:task/0b6b3570-cdcd-478a-83b6-daee928cb7fc]: could not find matching container for expected name []: Error: No such container: " module=docker_task_engine.go
level=warn time=2020-09-07T18:43:48Z msg="Task engine [arn:aws:ecs:eu-west-1:xxx:task/48804e72-1443-46f2-9141-3ef6eded84c3]: could not find matching container for expected name []: Error: No such container: " module=docker_task_engine.go
level=warn time=2020-09-07T18:43:48Z msg="Task engine [arn:aws:ecs:eu-west-1:xxx:task/5934c040-ea83-4755-91f0-9508d7267d54]: could not find matching container for expected name []: Error: No such container: " module=docker_task_engine.go
Then the ECS agent starts logging seemingly endlessly the same messages about the tasks that are stuck in RUNNING:
(n.b in the example above container id 61dfe4ce7a415e1c65bd6744fd998a351248d980d25554af8fcc46c3a8da2a4d is stopped)
Solution is either to stop the tasks via ECS API/Console or terminate the EC2 instance with the impacted agent. Note, draining the instance before the agent is restarted would presumably also fix the problem.
Perhaps this is related to the the new boltdb storage for state? To me this suggests other problems are well aside from when the agent restarts, what happens if it crashes or there is some other unexpected restart of the ec2 instance or docker? Will the state be inconsistent then?
Expected Behavior
The agent correctly saves the state before shutting down and correctly remembers the state after restarting. Tasks shouldn't be marked as RUNNING when their container isn't.
Observed Behavior
Agent doesn't correctly remember state upon shutting down and restarts in an inconsistent state. Tasks are left with state RUNNING but their containers aren't actually running.
Hi,
Sorry to hear that you had the issue. Currently, due to a bug introduced when switching to boltDB, the agent is relying on termination signal handler to store the container ID and name, and if the agent fails to save the state at termination, it will lose track of the containers. We have fixed the bug with #2608 and #2609, such that the agent will save the container ID and name when creating the container, so that one saving failure will just affect one container instead of all containers (which matches the behavior with previous agent that doesn't use boltDB). The fixes will go out in our next release. The current workaround before our next release would be to downgrade the agent to 1.43.0 or before.
I will mark this as a bug and we will let you know when the fix is released.
Summary
When the agent terminates the final state isn't saved due to a timeout. Upon restarting the state of the tasks vs the state of the actual containers are inconsistent.
Description
We first noticed a problem when we had tasks marked as RUNNING via the ECS control plane but upon inspection the container corresponding to the task was stopped. Obviously this causes serious problems because you can end up with whole services failing because while the ECS control plane reports tasks as RUNNING, no containers are actually running.
Upon termination the ecs-agent outputs a critical error message:
Upon restart it outputs the same warning relating to many tasks:
Then the ECS agent starts logging seemingly endlessly the same messages about the tasks that are stuck in RUNNING:
(n.b in the example above container id 61dfe4ce7a415e1c65bd6744fd998a351248d980d25554af8fcc46c3a8da2a4d is stopped)
Solution is either to stop the tasks via ECS API/Console or terminate the EC2 instance with the impacted agent. Note, draining the instance before the agent is restarted would presumably also fix the problem.
Perhaps this is related to the the new boltdb storage for state? To me this suggests other problems are well aside from when the agent restarts, what happens if it crashes or there is some other unexpected restart of the ec2 instance or docker? Will the state be inconsistent then?
Expected Behavior
The agent correctly saves the state before shutting down and correctly remembers the state after restarting. Tasks shouldn't be marked as RUNNING when their container isn't.
Observed Behavior
Agent doesn't correctly remember state upon shutting down and restarts in an inconsistent state. Tasks are left with state RUNNING but their containers aren't actually running.
Environment Details
Supporting Log Snippets
See above for snippets. Happy to be contacted for the tarball from the log collector.
The text was updated successfully, but these errors were encountered: