Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix issue1043 #1217

Merged
merged 6 commits into from
Jan 26, 2018
Merged

fix issue1043 #1217

merged 6 commits into from
Jan 26, 2018

Conversation

aaithal
Copy link
Contributor

@aaithal aaithal commented Jan 25, 2018

Summary

Implementation details

Testing

  • Builds on Linux (make release)
  • Builds on Windows (go build -out amazon-ecs-agent.exe ./agent)
  • Unit tests on Linux (make test) pass
  • Unit tests on Windows (go test -timeout=25s ./agent/...) pass
  • Integration tests on Linux (make run-integ-tests) pass
  • Integration tests on Windows (.\scripts\run-integ-tests.ps1) pass
  • Functional tests on Linux (make run-functional-tests) pass
  • Functional tests on Windows (.\scripts\run-functional-tests.ps1) pass
  • Manual tests:
    • Simulated inspect error in start container and verified that other containers and the started container in a task with ephemeral volumes were stopped and cleaned up properly
    • Simulated docker timeout error in start container and verified that other containers and the started container in a task with ephemeral volumes were stopped and cleaned up properly
      New tests cover the changes: yes!

Description for the changelog

Licensing

This contribution is under the terms of the Apache 2.0 License:

aaithal added a commit to aaithal/amazon-ecs-agent that referenced this pull request Jan 25, 2018
// a stop
seelog.Warnf("Managed task [%s]: Forcing container [%s] to stop",
mtask.Arn, container.Name)
go mtask.engine.transitionContainer(mtask.Task, container, api.ContainerStopped)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to check the KnownStatus of the container here? In case it has been stopped or already running, eg:

StartContainer -> inspect timeout -> stop the container 
StartEvent -> inspect timeout -> stop the container

@aaithal aaithal changed the title [wip] fix issue1043 fix issue1043 Jan 26, 2018
@aaithal aaithal added this to the 1.17.0 milestone Jan 26, 2018
@@ -267,6 +274,13 @@ func (mtask *managedTask) waitSteady() {
}
}

// steadyState returns if the task is in a steady state. Steady state is when task's desired
// and known status are both RUNNING
func (mtask *managedTask) steadyState() bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, and possibly my java-brain speak here: can this be named "isSteadyState" or "isInSteadyState"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, don't have strong opinions here. if task.steadyState() conveys the 'is task in steady state?' meaning in my brain. Golang prefers less verbosity in names (though we don't always stick to that convention). If you feel strongly about it, I can change it.

Fixed an issue where ECS agent could lose track of running containers
if there were timeouts strating the container or inspecting the
started container.
This should hopefully fix failure seen in
https://ci.appveyor.com/project/AmazonECS/amazon-ecs-agent/build/1620

This change ensures that the message is sent on the stopDone channel
before marking the waitgroup as done
Copy link

@richardpen richardpen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, one minor comment, also please make sure the header was updated to 2018 for the modified file.

CHANGELOG.md Outdated
@@ -5,6 +5,8 @@
* Feature - Support Docker health check.
* Bug - Fixed a bug where `-version` fails due to its dependency on docker client. [#1118](https://github.com/aws/amazon-ecs-agent/pull/1118)
* Bug - Persist container exit code in agent state file. [#1125](https://github.com/aws/amazon-ecs-agent/pull/1125)
* Bug - Persist container exit code in agent state file [#1125](https://github.com/aws/amazon-ecs-agent/pull/1125)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be consistent in the changelog, whether the . is needed at the end.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think convention has been to not add .'s at the end. I'll modify that.

@@ -518,6 +518,7 @@ func (dg *dockerGoClient) createContainer(ctx context.Context,
if err != nil {
return DockerContainerMetadata{Error: CannotCreateContainerError{err}}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: blank line

switch event.Status {
case "create":
status = api.ContainerCreated
// TODO no need to inspect containers here
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

im probably missing context, but what do you mean by "no need to inspect containers here"?

Copy link
Contributor Author

@aaithal aaithal Jan 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no need to inspect containers after they are created once we adopt Docker's volume APIs. Today, that's the only information we need from the inspect API. Once we start injecting that ourselves, there's no need to inspect containers on Create anymore. This will save us a lot of inspect calls in the future. I can update the comment to have this context

default:
// Because docker emits new events even when you use an old event api
// version, it's not that big a deal
seelog.Debugf("DockerGoClient: unknown status event from docker: %s", event.Status)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should we a tie an event.ID to this debug line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably. I'll add more details here.

return
}
if containerKnownStatus != api.ContainerStopped {
// Container's known status is STOPPED. Nothing to do here.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did you mean to put container's known status is NOT stopped here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. thanks for catching that.

taskKnownStatus := mtask.GetKnownStatus()
return taskKnownStatus == api.TaskRunning && taskKnownStatus >= mtask.GetDesiredStatus()
// handleContainerStoppedTransitionError handles an error when transitioning a container to
// STOPPED. It returns a boolean indicating whethere the taks can continue with updating its
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: whether, tasks

@aaithal aaithal merged commit 9f40dcf into aws:dev Jan 26, 2018
aaithal added a commit that referenced this pull request Jan 26, 2018
for !mtask.waitEvent(othersStopped) {
if mtask.GetDesiredStatus().Terminal() {
// If we end up here, that means we received a start then stop for this
// task before a task that was expected to stop before it could
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this comment is not readable. Can you please refactor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:) it's a right of initiation for anyone who's relatively new to this code base to read this and wonder what in the world is the meaning of this sentence. Since I've already merged this PR, we can pick this next time we touch this file

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this will help anyone who stumbles here: We want to process stop events (a task changes to desired=STOPPED) prior to start events (desired=RUNNING) so that any resources that are consumed by the desired=STOPPED task can be freed for the desired=RUNNING task. This logic helps to enforce the ordering.

@@ -418,7 +445,7 @@ func (mtask *managedTask) emitTaskEvent(task *api.Task, reason string) {
func (mtask *managedTask) emitContainerEvent(task *api.Task, cont *api.Container, reason string) {
event, err := api.NewContainerStateChangeEvent(task, cont, reason)
if err != nil {
seelog.Infof("Managed task [%s]: unable to create state change event for container [%s]: %v",
seelog.Debugf("Managed task [%s]: unable to create state change event for container [%s]: %v",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not Infof ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

valid internal state change change events like NONE and CREATED end up resulting in an error here. Hence, reverted back to Debug

container.SetKnownStatus(currentKnownStatus)
container.SetDesiredStatus(api.ContainerStopped)
errorName := event.Error.ErrorName()
if errorName == dockerTimeoutErrorName || errorName == cannotInspectContainerErrorName {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there other inspect error types? Is the error name check necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can either inspect types or use error names. Error names are easier as there's no type-casting involved. Instead of casting a wide net here, I'd prefer to selectively add error types as that impact to limit any potential unwanted impact of this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants