Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

separated logic for warm pools polling scenarios and do not fail on t… #3055

Merged
merged 1 commit into from
Oct 13, 2021

Conversation

lydiafilipe
Copy link
Contributor

Commit: separated logic for warm pools polling scenarios and do not fail on throttling or transient server errors once state obtained

Summary

  • EC2 instances could theoretically be in a warm pool indefinitely. These changes are to address the possibility that at some point while polling, the agent might be throttled or IMDS could be experiencing issues. These should not cause the agent to fail.
  • This changes the behavior so that once the target lifecycle state has been obtained, errors that are likely transient do not cause failure. This includes throttling and certain 5xx errors. However, before the state has been obtained, all errors except 404 errors will still cause failure.
  • Retry number for querying IMDS has also been updated from 3 to 5.

Implementation details

  • Separated polling for the first published value and the subsequent polling waiting for it to be in service.
  • Modified the waitUntilInstanceInService method to continue polling for certain errors once target state obtained.
  • Added separate method pollUntilLifecycleStateObtainedfor the initial polling for any value.

Testing

  • Ran unit tests with changes and added new tests
  • Ran code on EC2 and verified polling occurred in logs

Description for the changelog

Separated logic for warm pools polling scenarios and do not fail on throttling or transient server errors once state obtained

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

return err
}
// Poll while the instance is in a warmed state until it is going to go into service
for targetState != "InService" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we use inServiceState here?

// Poll while the instance is in a warmed state until it is going to go into service
for targetState != "InService" {
time.Sleep(pollWaitDuration)
targetState, err = agent.getTargetLifecycle(maxRetries)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the retries make sense for the first stage (i.e. pollUntilTargetLifecyclePresent), but what do you think of invoking this function like agent.getTargetLifecycle(1) at this point, since this loop will retry anyways.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. The main advantage I would see is the situation where we might get an errors on the API call when the instance becomes ready, and then delay the set up by a number of minutes. That would be an edge case, though I think would not be a great experience if it were to occur, so I would be inclined to keep more than one retry.

That said, thinking about it, I don't think the increase in retry count is really necessary, it should be fine at 3. In the second stage, we will retry anyway, and in the first stage we wouldn't expect throttling errors to be likely and wouldn't need additional retries in the failure scenarios

…hrottling or transient server errors once state obtained
@lydiafilipe lydiafilipe merged commit 3445809 into aws:feature/warm_pools Oct 13, 2021
lydiafilipe added a commit to lydiafilipe/amazon-ecs-agent that referenced this pull request Oct 15, 2021
…hrottling or transient server errors once state obtained (aws#3055)

Co-authored-by: Lydia Filipe <fillydia@amazon.com>
lydiafilipe added a commit to lydiafilipe/amazon-ecs-agent that referenced this pull request Nov 17, 2021
…hrottling or transient server errors once state obtained (aws#3055)

Co-authored-by: Lydia Filipe <fillydia@amazon.com>
@lydiafilipe lydiafilipe mentioned this pull request Feb 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants