Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

engine: add inactivity timeout for image pulling #1290

Merged
merged 1 commit into from
Mar 20, 2018

Conversation

fenxiong
Copy link
Contributor

@fenxiong fenxiong commented Mar 12, 2018

Summary

This pull request is for issue 1249 here: #1249, which is to introduce InactivityTimeout for image pulling.

Implementation details

By using the inactivity timeout parameter supported by go-dockerclient.

Testing

  • Builds on Linux (make release)
  • Builds on Windows (go build -out amazon-ecs-agent.exe ./agent)
  • Unit tests on Linux (make test) pass
  • Unit tests on Windows (go test -timeout=25s ./agent/...) pass
  • Integration tests on Linux (make run-integ-tests) pass
  • Integration tests on Windows (.\scripts\run-integ-tests.ps1) pass
  • Functional tests on Linux (make run-functional-tests) pass
  • Functional tests on Windows (.\scripts\run-functional-tests.ps1) pass

New tests cover the changes: yes

Three manual testings are performed as follow, with timeout set to 5 seconds:
(1) Start pulling a large image, send SIGSTOP to Docker daemon process, see whether inactivity timeout is triggered; send SIGCONT to Docker daemon process, see whether pulling resumes.
Result: after SIGSTOP, timeout is triggered; after SIGCONT, pulling resumes

(2) Start pulling a large image (from ECR), blackhole ECR, see whether timeout is triggered; remove blackhole, see whether pulling resumes
Result: after adding blackhole, the pulling fails immediately due to connection failure, and doesn't reach the timeout limit.

(3) Start pulling a large image, randomly drop 50% of packets on the host during the pull
Result: inactivity timeout is not triggered; pulling succeeds normally

Description for the changelog

Enhancement - Introduce InactivityTimeout to image pulling.

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@adnxn adnxn added this to the 1.18.0 milestone Mar 12, 2018
@@ -83,6 +83,10 @@ const (
// around a docker bug which sometimes results in pulls not progressing.
dockerPullBeginTimeout = 5 * time.Minute

// dockerPullInactivityTimeout is the amount of time that we will
// wait when the pulling does not progress
dockerPullInactivityTimeout = 5 * time.Minute
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how did we decide on 5 mins? is there any data/metrics available for us? Ideally p99 would be a good value to use.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently it's set arbitrarily. i'm not sure if there's any data/metrics for this

Copy link
Contributor

@adnxn adnxn Mar 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yunhee-l, hm we havent figured out a good way to measure p99 in this case. the only real visibility we've had so far is from root causing hanging tasks/containers which will get stuck for arbitrary periods of time. we want to introduce as a fail fast for docker inactivity at least during image pulls.

do you know if we could get a better idea of what to set here from ecr side? 😕

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From ECR side, P99 is in seconds not minutes (I can show you the dashboard in person) so 5 mins is more than long enough timeout. Thought 5 mins was a bit too long in general but if it's only meant to be used for hanging tasks I suppose it is not a big issue. However flip side is whether this might adversely impact large image pull that can take longer than 5 mins? Is ECR the only repo we pull from?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However flip side is whether this might adversely impact large image pull that can take longer than 5 mins?

The dockerPullInactivityTimeout shouldn't adversely affect large image pulls. This is explicitly to checking to timeout if docker is hanging during the download and large image pulls should show continuous activity during the download.

Is ECR the only repo we pull from?

No, the agent could be pulling from anywhere.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the explanation! if timeout is specifically for the case where docker is hanging and inactive, then I don't think any of my concerns are relevant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just as a check, i tested it manually with a large image (3GB) and small timeout (5 secs), and the pulling worked fine with no timeout.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5m seems too long here. I don't think this should be greater than 1m.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timeout is set to 1 minute now. I have commented on the SageMaker ticket to see if there's any input at their side.

@aaithal
Copy link
Contributor

aaithal commented Mar 13, 2018

@fenxiong can you please incorporate the following tests for this as well?

  1. Start pulling a large image, blackhole ECR/Dockerhub (wherever the image is being pulled from), ensure that this timeout works
  2. Start pulling a large image, send SIGSTOP to Docker daemon process, ensure that timeout works
  3. Same as [1] or [2], but remove the blackhole/send SIGCONT and ensure that pull resumes
  4. Randomly drop 50% of packets on the host, ensure that pull inactivity timeout behavior works

@fenxiong
Copy link
Contributor Author

@aaithal I have finished the above tests, with timeout set to 5 seconds. For the Docker daemon test, i'm able to observe the timeout and able to restore it. For the blackhole test, i'm not able to observe the timeout because after i added the ip to blackhole the pulling failed immediately and wouldn't reach timeout. For the random drop case, i saw no timeout happen, and i think this is expected right? if 50% packets are not dropped the pulling won't stop

docker.ErrInactivityTimeout).Times(maximumPullRetries) // expected number of retries

metadata := client.PullImage("image", nil)
if metadata.Error == nil {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use the assert instead. Same below.

@fenxiong fenxiong force-pushed the inactivity-pull branch 4 times, most recently from 8712e0d to b953fba Compare March 14, 2018 17:54
Copy link
Contributor

@adnxn adnxn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change lgtm, thanks for the detailed testing report in the PR summary 😁

@fenxiong fenxiong merged commit 3b211fc into aws:dev Mar 20, 2018
@aaithal aaithal modified the milestones: 1.18.0, 1.17.3 Mar 21, 2018
@aaithal
Copy link
Contributor

aaithal commented Mar 21, 2018

@fenxiong can you please make sure to submit a new PR that modifies CHANGELOG.md file with a reference to this PR?

Thanks,
Anirudh

fenxiong added a commit to fenxiong/amazon-ecs-agent that referenced this pull request Mar 21, 2018
@fenxiong fenxiong mentioned this pull request Mar 21, 2018
8 tasks
fenxiong added a commit to fenxiong/amazon-ecs-agent that referenced this pull request Mar 21, 2018
fenxiong added a commit to fenxiong/amazon-ecs-agent that referenced this pull request Mar 21, 2018
fenxiong added a commit to fenxiong/amazon-ecs-agent that referenced this pull request Mar 22, 2018
fenxiong added a commit that referenced this pull request Mar 22, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants