Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix container metadata information missing during agent restart #1183

Merged
merged 3 commits into from
Jan 31, 2018

Conversation

richardpen
Copy link

@richardpen richardpen commented Jan 15, 2018

Summary

Fix the issue where agent could miss the container metadata information(portmapping/exitcode), if the container is started/stopped when the agent isn't running(eg: restart).

Implementation details

Record the metadata information during the reconciliation on agent restart.

Testing

  • Builds on Linux (make release)
  • Builds on Windows (go build -out amazon-ecs-agent.exe ./agent)
  • Unit tests on Linux (make test) pass
  • Unit tests on Windows (go test -timeout=25s ./agent/...) pass
  • Integration tests on Linux (make run-integ-tests) pass
  • Integration tests on Windows (.\scripts\run-integ-tests.ps1) pass
  • Functional tests on Linux (make run-functional-tests) pass
  • Functional tests on Windows (.\scripts\run-functional-tests.ps1) pass
  • Manual test:
    • Reproducible with the previous agent, but not with this fix.
    • Container information(port mapping/exit code) correctly submitted to backend.
    • Container information was updated correctly after agent restart while container is created/started/stopped when agent isn't running.

New tests cover the changes:
yes

Description for the changelog

Licensing

This contribution is under the terms of the Apache 2.0 License:
yes

@richardpen richardpen force-pushed the fix-restart branch 2 times, most recently from caf8265 to 20d7ce9 Compare January 22, 2018 22:37
@richardpen richardpen changed the title [WIP] Fix container metadata information missing during agent restart Fix container metadata information missing during agent restart Jan 22, 2018
@richardpen richardpen force-pushed the fix-restart branch 3 times, most recently from efa87d9 to 423d1ae Compare January 22, 2018 23:13
@aaithal
Copy link
Contributor

aaithal commented Jan 23, 2018

@richardpen can you add more details about what Manual tests you're planning to run and if they have passed/failed as well?

if !container.Container.KnownTerminal() {
container.Container.ApplyingError = api.NewNamedError(&ContainerVanishedError{})
// If this is a Docker API error
if metadata.Error.ErrorName() == "CannotDescribeContainerError" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: Can you const "CannotDescribeContainerError" ?

now := engine.time().Now()
ok := task.SetExecutionStoppedAt(now)
if ok {
seelog.Infof("Recording execution stopped time for a task, essential container in task stopped, task %s, time: %s",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once #1208 is merged, please rebase this on top of that and don't forget to update the log string as per the format being followed in that:

  • "Recording execution -> "Task engine [%s] Recording execution
  • instead of task.String(), use task.Arn

@richardpen richardpen force-pushed the fix-restart branch 2 times, most recently from cf2c507 to 3181f40 Compare January 23, 2018 23:30
@@ -332,6 +365,15 @@ func (engine *DockerTaskEngine) synchronizeContainerStatus(container *api.Docker
// update the container known status
container.Container.SetKnownStatus(currentState)
}
// Update task ExecutionStoppedAt timestamp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rebase this from dev once #1217 is merged. This is now moved into its own method. May be it's better to review this after that PR is merged

@aaithal aaithal added this to the 1.17.0 milestone Jan 26, 2018

// Set Exitcode if it's not set
if metadata.ExitCode != nil && (container.GetKnownExitCode() == nil ||
aws.IntValue(metadata.ExitCode) != aws.IntValue(container.GetKnownExitCode())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need the condition checks
(container.GetKnownExitCode() == nil || aws.IntValue(metadata.ExitCode) != aws.IntValue(container.GetKnownExitCode())) ?
can't we just set it irrespective of it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should have the same effect, but not sure which way is better. I can change that if no one is on the opposite side.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vote for setting it anyway and getting rid of this check. nice catch!

@@ -1103,3 +1103,21 @@ func (task *Task) GetID() (string, error) {

return resourceSplit[1], nil
}

func (task *Task) RecordExecutionStoppedAt(container *Container) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lint: // RecordExecutionStoppedAt ...

@@ -1720,3 +1720,170 @@ func TestHandleDockerHealthEvent(t *testing.T) {
})
assert.Equal(t, testContainer.Health.Status, api.ContainerHealthy)
}

func TestCreatedContainerMetadataUpdateOnRestart(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can combine all of these into a single test. Either a table based test or a task having all of these containers.

Copy link
Contributor

@adnxn adnxn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code lgtm, minor comments about documentation. also i agree with @aaithal wrt merging those four *MetadataUpdateOnRestart tests into a single test with tables.

@@ -290,6 +291,36 @@ func (engine *DockerTaskEngine) synchronizeState() {
engine.saver.Save()
}

func updateContainerMetadata(metadata *DockerContainerMetadata, container *api.Container, task *api.Task) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: // updateContainerMetadata ...

@@ -301,51 +332,49 @@ func (engine *DockerTaskEngine) synchronizeContainerStatus(container *api.Docker
seelog.Warnf("Task engine [%s]: could not find matching container for expected name [%s]: %v",
task.Arn, container.DockerName, err)
} else {
// update the container metadata in case the container status/metadata changed during agent restart
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so if i understand correctly, this codepath is for containers that were potentially created while we were down (debug msg above). is that right? maybe update the comment to reflect that.

@richardpen richardpen merged commit 233e382 into aws:dev Jan 31, 2018
@richardpen richardpen deleted the fix-restart branch March 2, 2018 23:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants