Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the deadlock caused by the ImagePullDeleteLock #836

Merged
merged 3 commits into from
Jun 9, 2017

Conversation

richardpen
Copy link

@richardpen richardpen commented Jun 8, 2017

Summary

Fix a dead lock that was caused by the ImageCleanup and Image Pull. Issue #833

Implementation details

Based on the explanation here, adjust the order of lock acquiring in image cleanup, so that both go routines will require the lock in the same order, which wouldn't cause deadlock.

Testing

  • Builds on Linux (make release)
  • Builds on Windows (go build -out amazon-ecs-agent.exe ./agent)
  • Unit tests on Linux (make test) pass
  • Unit tests on Windows (go test -timeout=25s ./agent/...) pass
  • Integration tests on Linux (make run-integ-tests) pass
  • Integration tests on Windows (.\scripts\run-integ-tests.ps1) pass
  • Functional tests on Linux (make run-functional-tests) pass
  • Functional tests on Windows (.\scripts\run-functional-tests.ps1) pass

New tests cover the changes:

Description for the changelog

Fix an issue which could potentially cause agent running into deadlock.

Licensing

This contribution is under the terms of the Apache 2.0 License:
yes

@vsiddharth
Copy link
Contributor

These changes should eliminate the deadlock we encountered with #833
LGTM

Copy link
Contributor

@nmeyerhans nmeyerhans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit: LGTM

Copy link
Contributor

@petderek petderek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

seelog.Debug("Attempting to obtain ImagePullDeleteLock for removing images")
ImagePullDeleteLock.Lock()
seelog.Debug("Obtained ImagePullDeleteLock for removing images")
defer seelog.Debug("Released ImagePullDeleteLock after removing images")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe group these statements in an anonymous function so its easier to see the order of execution?

defer func() { ... }

// Cause a fake delay when recording container reference so that the
// race condition between ImagePullLock and updateLock gets exercised
// If updateLock precedes ImagePullLock, it can cause a deadlock
client.EXPECT().InspectImage(sleepContainer.Image).Do(func(image string) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! Did we verify that this fails before the patch?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this test doesn't pass without the patch.

@richardpen richardpen merged commit c119166 into aws:dev Jun 9, 2017
@adnxn adnxn mentioned this pull request Jun 9, 2017
@samuelkarp samuelkarp added this to the 1.14.3 milestone Jun 9, 2017
@richardpen richardpen deleted the test-only branch October 17, 2017 18:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants