Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Agent to be more resilient in case of unauthenticated timeouts with IMDS #3795

Merged
merged 1 commit into from
Jul 13, 2023

Conversation

hozkaya2000
Copy link
Contributor

@hozkaya2000 hozkaya2000 commented Jul 11, 2023

Summary

There is an edge case where the credentials within the agent used to access internal resources can be incorrect due to other processes restarting during a certain time interval. This can cause the agent to think that the credentials are not expired and correct. The change is to prevent this from happening, and refreshing the credentials in the first location this issue would appear, which would be RegisterContainerInstance in the ecsclient.

Implementation details

The issue occurs in the RegisterContainerInstance function in ecsclient's client.go. RegisterContainerInstance makes the call to IMDS in SetInstanceIdentity function where the client for communicating with IMDS is used with the call client.ec2metadata.GetDynamicData(...). The new IMDS process does not work with the current credentials.

In the fix, this line is replaced with a retry strategy. When we get an error code, the current credentials in agent are forced before the retry to be expired, and a GetCredentials call is made to refresh the credentials, with multiple retries if it fails again. The global instance credentials in the agent are refreshed with this call, so other calls made by agent to IMDS are no longer an issue.

Testing

To test, a new case was added to the RegisterContainerInstanceTest that is the exact same as the basic success case, but the first call to client.ec2metadata.GetDynamicData(...) was simulated to fail, and the retry call to succeed. The rest of the test validations remained, making sure that retries didn't prevent any other process from happening.

New tests cover the changes: yes

Description for the changelog

Files:

  • agent/api/ecsclient/client.go
  • agent/api/ecsclient/client_test.go

were changed as described in the implementation and testing details

Licensing

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@hozkaya2000 hozkaya2000 requested a review from a team as a code owner July 11, 2023 22:53
@hozkaya2000 hozkaya2000 changed the title bug fix for https://sim.amazon.com/issues/Agent-8841 Update Agent to be more resilient in case of 401 unauthenticated timeouts with IMDS (and potentially other cases) Jul 11, 2023
@hozkaya2000 hozkaya2000 changed the title Update Agent to be more resilient in case of 401 unauthenticated timeouts with IMDS (and potentially other cases) Update Agent to be more resilient in case of unauthenticated timeouts with IMDS Jul 11, 2023
Copy link
Contributor

@Realmonia Realmonia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please force push to remove the internal link from commit message.

agent/api/ecsclient/client.go Outdated Show resolved Hide resolved
agent/api/ecsclient/client.go Outdated Show resolved Hide resolved
agent/api/ecsclient/client.go Outdated Show resolved Hide resolved
@hozkaya2000 hozkaya2000 merged commit 255ed43 into aws:dev Jul 13, 2023
6 checks passed
@hozkaya2000 hozkaya2000 deleted the branch_pull_behavior branch July 13, 2023 20:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants