Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dockerstate: map keys differ before create #1033

Merged
merged 5 commits into from
Oct 25, 2017
Merged

Conversation

samuelkarp
Copy link
Contributor

@samuelkarp samuelkarp commented Oct 25, 2017

The state.idToTask and state.idToContainer maps initially have the keys
set to the generated DockerName rather than the DockerID. Once
containers are created, the keys are changed to the DockerID. This
changed as part of this commit:
49c36c7#diff-464fc7e15d1a6b818ceca89e7b68cd4e.

Summary

Fixes #1024

If a task is run where the image fails to pull, the agent will get into an unrecoverable state after it cleans up the task. This happens because it fails to delete an entry in the state.idToTask and state.idToContainer maps as it attempts to delete with the wrong key.

Implementation details

The state.idToTask and state.idToContainer maps initially have the keys set to the generated DockerName rather than the DockerID. Once containers are created, the keys are changed to the DockerID. This changed as part of this commit: 49c36c7#diff-464fc7e15d1a6b818ceca89e7b68cd4e.

Testing

  • Builds on Linux (make release)
  • Builds on Windows (go build -out amazon-ecs-agent.exe ./agent)
  • Unit tests on Linux (make test) pass
  • Unit tests on Windows (go test -timeout=25s ./agent/...) pass
  • Integration tests on Linux (make run-integ-tests) pass
  • Integration tests on Windows (.\scripts\run-integ-tests.ps1) pass
  • Functional tests on Linux (make run-functional-tests) pass
  • Functional tests on Windows (.\scripts\run-functional-tests.ps1) pass

New tests cover the changes: yes

Description for the changelog

Bug - Fixed a bug where tasks that fail to pull containers can cause the agent to fail to restore properly after a restart. #1024

Licensing

This contribution is under the terms of the Apache 2.0 License: yes

Copy link
Contributor

@petderek petderek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but 'RemoveTask" should have a unit test that validates this change.

@samuelkarp
Copy link
Contributor Author

LGTM but 'RemoveTask" should have a unit test that validates this change.

Yep, that's why this is marked WIP right now. Tests are forthcoming (probably tomorrow).

Copy link
Contributor

@aaithal aaithal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, will approve it when its not wip anymore

Copy link
Contributor

@vsiddharth vsiddharth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀

Please add tests before merging!

@samuelkarp samuelkarp changed the title [WIP] dockerstate: map keys differ before create dockerstate: map keys differ before create Oct 25, 2017
@samuelkarp
Copy link
Contributor Author

samuelkarp commented Oct 25, 2017

I've modified some unit tests and also added a new functional test to ensure we don't regress behavior here. This test fails with the v1.14.5 agent.

Passing:

=== RUN   TestSavedStateWithInvalidImageAndCleanup
Container InstanceArn: arn:aws:ecs:us-west-2:123456789012:container-instance/50ca3395-2993-4d8f-b362-f2cf0a12fc28
Sleeping...
Container InstanceArn: arn:aws:ecs:us-west-2:123456789012:container-instance/50ca3395-2993-4d8f-b362-f2cf0a12fc28
--- PASS: TestSavedStateWithInvalidImageAndCleanup (143.41s)
	utils_unix.go:119: Created directory /tmp/ecs_integ_testdata958754272 to store test data in
	utils_unix.go:133: Launching agent with image: amazon/amazon-ecs-agent:make
	utils_unix.go:203: Agent started as docker container: 11b79d85286f3f524c4bafbc69bea2e3d3b06ea1bd69b3bf28ab5cd2aea8eaaa
	utils.go:149: Found agent metadata: {Cluster:ftest ContainerInstanceArn:0xc4204126b0 Version:Amazon ECS Agent - v1.14.5 (*01d1bef)}
	utils.go:170: Task definition: ecsinteg-invalid-image-994b6e90c23db8b0454da881022416d2:1
	utils.go:190: Started task: arn:aws:ecs:us-west-2:123456789012:task/1d5de746-c8b8-4f65-a77c-0016cb342595
	utils_unix.go:133: Launching agent with image: amazon/amazon-ecs-agent:make
	utils_unix.go:203: Agent started as docker container: 151f2fd3cb64e0c851eb49abec170a0362a8cb27e044a09411753d8006acb1ad
	utils.go:149: Found agent metadata: {Cluster:ftest ContainerInstanceArn:0xc4203fa2f0 Version:Amazon ECS Agent - v1.14.5 (*01d1bef)}
	utils.go:159: Removing test dir for passed test /tmp/ecs_integ_testdata958754272
PASS
ok  	github.com/aws/amazon-ecs-agent/agent/functional_tests/tests	143.690s

Failing:

=== RUN   TestSavedStateWithInvalidImageAndCleanup
Container InstanceArn: arn:aws:ecs:us-west-2:123456789012:container-instance/6158ba9a-785c-4eff-b6f9-22a701ed0ec1
Sleeping...
--- FAIL: TestSavedStateWithInvalidImageAndCleanup (152.49s)
	utils_unix.go:122: Created directory /tmp/ecs_integ_testdata310137584 to store test data in
	utils_unix.go:136: Launching agent with image: amazon/amazon-ecs-agent:v1.14.5
	utils_unix.go:206: Agent started as docker container: aee2037ab407a33c252800558ab21bdff980ac6c9925eb1d557219934379bed6
	utils.go:149: Found agent metadata: {Cluster:ftest ContainerInstanceArn:0xc420330240 Version:Amazon ECS Agent - v1.14.5 (0dcd02c)}
	utils.go:170: Task definition: ecsinteg-invalid-image-994b6e90c23db8b0454da881022416d2:1
	utils.go:190: Started task: arn:aws:ecs:us-west-2:123456789012:task/d69e1bea-ae1f-4a30-928b-59db19fb69b4
	utils_unix.go:136: Launching agent with image: amazon/amazon-ecs-agent:v1.14.5
	utils_unix.go:206: Agent started as docker container: 0aea2b6487b3f1043c11cd4463dd1daf462f1906429122489c7cf776d985c7e8
        Error Trace:    functionaltests_test.go:123
	Error:		Received unexpected error "Could not get agent metadata after launching it"
	Messages:	failed to start agent again
		
	utils.go:157: Preserving test dir for failed test /tmp/ecs_integ_testdata310137584
FAIL
exit status 1
FAIL	github.com/aws/amazon-ecs-agent/agent/functional_tests/tests	152.762s

assert.Equal(t, *testTask.TaskArn, resp.Arn, "arn should be equal")

// wait two minutes for it to be cleaned up
fmt.Println("Sleeping...")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: remove this? Or use a logger?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fmt.Println will print out immediately (when running tests for a single package) while t.Log buffers until the test ends. I had wanted to print this out during the run so I could inspect the agent logs while it was sleeping. I'm inclined to leave this in right now, but I can remove it if you feel strongly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah, you can leave this be. It's functional tests and it doesn't matter if log out to stdout here immediately.

delete(state.idToContainer, dockerContainer.DockerID)
// The key to these maps is either the Docker ID or agent-generated name. We use the agent-generated name
// before a Docker ID is available.
key := dockerContainer.DockerID
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be safer to move this out of this method into a getKey(container) method (so that future modifications in this method do not touch that critical piece of business logic). what do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code that adds to the map also removes references during a subsequent call. I'll extract both operations out to explicit functions with comments explaining why they're doing this.

Copy link

@richardpen richardpen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM minor comemnts

CHANGELOG.md Outdated
@@ -6,6 +6,9 @@
[#1014](https://github.com/aws/amazon-ecs-agent/pull/1014)
* Enhancement - Support `init` process in containers by adding support for Docker remote API client version 1.25
[#996](https://github.com/aws/amazon-ecs-agent/pull/996)
* Bug - Fixed a bug where tasks that fail to pull containers can cause the
agent to fail to restore properly after a restart.
[#1024](https://github.com/aws/amazon-ecs-agent/issues/1024)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this link to the pr?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote this before the PR was created. I can update it.

@@ -166,8 +166,12 @@ func (agent *TestAgent) StartAgent() error {
Links: agent.Options.ContainerLinks,
}

if os.Getenv("ECS_FTEST_FORCE_NET_HOST") != "" {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this environment variable being set?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not automatically set. I added this because I was developing the test in an environment where the docker0 bridge did not have Internet access (for unrelated reasons) and wanted the agent to actually be able to run.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then can you add this into the agent/functional_tests/README.d.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I rewrote a bit more of the README to make it more clear.

The state.idToTask and state.idToContainer maps initially have the keys
set to the generated DockerName rather than the DockerID.  Once
containers are created, the keys are changed to the DockerID.  This
changed as part of this commit:
aws@49c36c7#diff-464fc7e15d1a6b818ceca89e7b68cd4e.
This test verifies that the agent can be restarted after a task with an
invalid image is cleaned up.  This test fails with v1.14.5 and passes
after this commit.
@samuelkarp
Copy link
Contributor Author

@aaithal @richardpen Can you take a look?

Copy link
Contributor

@aaithal aaithal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sail boat

@samuelkarp samuelkarp merged commit 81b8672 into aws:dev Oct 25, 2017
@aaithal aaithal added this to the 1.15.0 milestone Oct 30, 2017
@jhaynes jhaynes mentioned this pull request Oct 30, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants