Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix concurrent writes during image cleanup #743

Closed

Conversation

vsiddharth
Copy link
Contributor

@vsiddharth vsiddharth commented Mar 28, 2017

This patch inspects the cleanup and resolves the inherent concurrent map write
issue reported. An unit test has been added to increase confidence in the fix.

Fixes #707

Signed-off-by: Vinothkumar Siddharth sidvin@amazon.com

Summary

Implementation details

Testing

  • Builds on Linux (make release)
  • Builds on Windows (go build -out amazon-ecs-agent.exe ./agent)
  • Unit tests on Linux (make test) pass
  • Unit tests on Windows (go test -timeout=25s ./agent/...) pass
  • Integration tests on Linux (make run-integ-tests) pass
  • Integration tests on Windows (.\scripts\run-integ-tests.ps1) pass
  • Functional tests on Linux (make run-functional-tests) pass
  • Functional tests on Windows (.\scripts\run-functional-tests.ps1) pass

New tests cover the changes:

Description for the changelog

Licensing

This contribution is under the terms of the Apache 2.0 License:

@@ -951,3 +953,68 @@ func TestGetImageStateFromImageNameNoImageState(t *testing.T) {
t.Error("Incorrect image state retrieved by image name")
}
}

func TestConcurrentRemoveUnusedImages(t *testing.T) {
// NOTE: Test would fail without the corresponding fix

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to put this line into the description of PR.

}
require.Equal(t, 1, len(imageManager.imageStates))

err = imageManager.RemoveContainerReferenceFromImageState(container)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment here to explain why this is needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is part of the standard setup in tests of similar nature here.
They help trigger the use-case under test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vsiddharth Please add a comment in code to describe the same

}

imageState, _ := imageManager.getImageState(imageInspected.ID)
imageState.RemoveImageName(container.Image)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also why this line is required?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can get rid of this line by changing client.EXPECT().RemoveImage(sourceImage.ImageID, removeImageTimeout).Return(nil) to client.EXPECT().RemoveImage(container.Image, removeImageTimeout).Return(nil). Because image manager will try to delete the image by name first, if there is no name then it will delete by id.

@@ -277,6 +275,8 @@ func (imageManager *dockerImageManager) performPeriodicImageCleanup(ctx context.
}

func (imageManager *dockerImageManager) removeUnusedImages() {
imageManager.updateLock.Lock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you document if you've observed any slowness because of this in task launch latencies on an instance where there are at least 5 images to clean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not observe any significant latencies

Copy link
Contributor

@aaithal aaithal May 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Github does not let me add comments on code that needs to modified outside this PR, so commenting here instead. It'd be a lot better if we could add some logging here because this method acquires at least 3 locks during its lifetime. Specifically, a log line to indicate that we are cleaning all tracking information for the image name because it has 0 references to it at line num 351 (after the if loop)

Copy link
Contributor

@samuelkarp samuelkarp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this mostly looks good, though it's hard to see the codepath in the diff (I had to open an editor and trace through it). Please address @richardpen's comment as well as mine.

require.Equal(t, 1, len(imageManager.imageStates))

err = imageManager.RemoveContainerReferenceFromImageState(container)
if err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you change this to assert.NoError?

client.EXPECT().RemoveImage(container.Image, removeImageTimeout).Return(nil)
require.Equal(t, 1, len(imageManager.imageStates))

numRoutines := 1000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason you picked 1000?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There isn't a significant reason for this choice. Its something used to recreate the problem.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment to that effect. Right now 1000 is just a magic number in the source code.

@samuelkarp
Copy link
Contributor

@vsiddharth Do you plan to address the comments here? @richardpen still has one requested change and both @aaithal and I have asked questions.

@vsiddharth
Copy link
Contributor Author

@richardpen @aaithal @samuelkarp I've responded to most of the comments a few days back. Please let me know if you have any further questions.

Thanks

@@ -277,6 +275,8 @@ func (imageManager *dockerImageManager) performPeriodicImageCleanup(ctx context.
}

func (imageManager *dockerImageManager) removeUnusedImages() {
imageManager.updateLock.Lock()
Copy link
Contributor

@aaithal aaithal May 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Github does not let me add comments on code that needs to modified outside this PR, so commenting here instead. It'd be a lot better if we could add some logging here because this method acquires at least 3 locks during its lifetime. Specifically, a log line to indicate that we are cleaning all tracking information for the image name because it has 0 references to it at line num 351 (after the if loop)

@@ -348,6 +349,7 @@ func (imageManager *dockerImageManager) deleteImage(imageID string, imageState *
seelog.Infof("Image removed: %v", imageID)
imageState.RemoveImageName(imageID)
if len(imageState.Image.Names) == 0 {
seelog.Infof("Cleaning up all tracking information for image %v as it has zero references", imageID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Please change %v to %s before merging this.

This patch inspects the cleanup and resolves the inherent concurrent map write
issue reported. An unit test has been added to increase confidence in the fix.

Fixes aws#707

Signed-off-by: Vinothkumar Siddharth <sidvin@amazon.com>
@vsiddharth vsiddharth force-pushed the imageManager-concurrent-map-write branch from ac820c5 to b019dfb Compare May 23, 2017 22:12
@vsiddharth
Copy link
Contributor Author

@samuelkarp Could you have another look at this PR ?

@samuelkarp
Copy link
Contributor

@vsiddharth What's failing in the integration tests on Windows?

@vsiddharth
Copy link
Contributor Author

@samuelkarp I've observed the windows failures and that is due to a flaky test that we already track.
The test usually succeeds on rerun or when run manually.

@adnxn
Copy link
Contributor

adnxn commented May 24, 2017

@vsiddharth which test is it exactly?

@vsiddharth
Copy link
Contributor Author

The test that failed was TestImageWithSameNameAndDifferentID but this has already been addressed in the dev branch

@adnxn
Copy link
Contributor

adnxn commented May 26, 2017

Merged with this commit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants