Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modified pull image retry with 1100ms * 2^n strategy and rename Simpl… #1808

Merged
merged 1 commit into from
Jan 25, 2019

Conversation

suneyz
Copy link
Contributor

@suneyz suneyz commented Jan 23, 2019

Summary

  • Change retry interval for image pull from (250ms * 1.5^n) to (1.1s * 2^n) to make sure pull retry jumps over image pull throttle bucket for ECR images
  • Rename SimpleBackoff to ExponentialBackoff to align with what it actually does
  • Refactor retry+backoff into its own package

Implementation details

Testing

  • Builds on Linux (make release)
  • Builds on Windows (go build -out amazon-ecs-agent.exe ./agent)
  • Unit tests on Linux (make test) pass
  • Unit tests on Windows (go test -timeout=30s ./agent/...) pass
  • Integration tests on Linux (make run-integ-tests) pass
  • Integration tests on Windows (.\scripts\run-integ-tests.ps1) pass
  • Functional tests on Linux (make run-functional-tests) pass
  • Functional tests on Windows (.\scripts\run-functional-tests.ps1) pass

New tests cover the changes:

Description for the changelog

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@suneyz suneyz requested a review from a team January 23, 2019 01:26
@suneyz suneyz force-pushed the image_timeout branch 2 times, most recently from b4fe760 to 6a07251 Compare January 23, 2019 01:42
@suneyz suneyz removed the bot/test label Jan 23, 2019
pullRetryDelayMultiplier = 1.5
minimumPullRetryDelay = 1100 * time.Millisecond
maximumPullRetryDelay = 5 * time.Second
pullRetryDelayMultiplier = 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these to make the tests which depend on pulling images from ECR less flaky?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, the longer interval will make sure that if user gets throttled by ECR. the next retry will jump over a throttle bucket (1s). With the current setup starting with 0.25s, the user will be throttled again because the next retry is still within the same throttle bucket.

xPullRetryDelayMultiplier = 2
xPullRetryJitterMultiplier = 0.2
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these value match with the ones in docker_client?

If no, how are these decided?

If yes, should we keep a constant to be reused at both the places.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was deciding wether to increase unit test timeout vs use a customized unit test backoff mechanism. I decided to use different backoff values for unit test is because that if we are using same values from docker client, it will make the unit test of this file to run ~60s. I noticed that our test already takes quite long to run, using customized value for unit might be a better option.

@@ -229,8 +230,8 @@ func (engine *DockerTaskEngine) MustInit(ctx context.Context) {
defer engine.mustInitLock.Unlock()

errorOnce := sync.Once{}
taskEngineConnectBackoff := utils.NewSimpleBackoff(200*time.Millisecond, 2*time.Second, 0.20, 1.5)
utils.RetryWithBackoff(taskEngineConnectBackoff, func() error {
taskEngineConnectBackoff := retry.NewExponentialBackoff(200*time.Millisecond, 2*time.Second, 0.20, 1.5)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move the values here to a constant as well, will give a better understanding of what they are.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, make sense :)

@@ -13,4 +13,4 @@ install:
build_script:
- go build ./agent
test_script:
- for /f "" %%G in ('go list github.com/aws/amazon-ecs-agent/agent/... ^| find /i /v "/vendor/"') do ( go test -race -tags unit -timeout 40s %%G & IF ERRORLEVEL == 1 EXIT 1)
- for /f "" %%G in ('go list github.com/aws/amazon-ecs-agent/agent/... ^| find /i /v "/vendor/"') do ( go test -race -tags unit -timeout 40s %%G & IF ERRORLEVEL == 1 EXIT 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we increase the timeout to 60 here as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forgot to update summary after I decided to use a customized backoff to shorten runtime. I didn't go with the route to update longer unit test timeout. The summary should be updated now.

Copy link
Contributor

@shubham2892 shubham2892 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few minor, looks good to me otherwise.

@suneyz suneyz added bot/test and removed bot/test labels Jan 24, 2019
@@ -0,0 +1,48 @@
package retry
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think you need to add copy right header for the new files in this package?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and update the header of the modified files to xxx-2019 i guess

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! updated

…eBackoffRetry to ExponentialBackoffRetry into a separate retry package
@suneyz suneyz added bot/test and removed bot/test labels Jan 25, 2019
@suneyz suneyz merged commit 0783e7f into aws:dev Jan 25, 2019
@suneyz suneyz deleted the image_timeout branch January 25, 2019 20:55
@suneyz suneyz added this to the 1.25.2 milestone Jan 30, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants