-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(cli): CLI hangs for 10 minutes on expired credentials #21052
Conversation
When using environment variable credentials (`AWS_ACCESS_KEY_ID` etc) that were expired, the CLI would proceed to retry calls involving those credentials because the `ExpiredToken` error is marked as `retryable: true`. Because we have extremely aggressive timeouts for most of our SDK calls (since the CloudFormation throttling limits are low and we generate a lot of contention on them), calls can take up to 10 minutes to run out of retries. Try and detect `ExpiredToken` situations sooner and error out harder without trying to recover from them. This PR only handles the situation where there is a Roles to assume -- this works because calls to STS have a much lower retry count, and so it only takes a couple of seconds to run out of retries and surface the `ExpiredToken` to the CLI, which we can then use to abort early.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
My only question is the PR description seems to imply that there are other places that this logic should be implemented (you mention CFN calls as an issue). Is the 10 minute hanging only caused by these STS calls? If we obtain valid credentials, start a deployment, and then they expire is there still an issue?
It might be helpful to indicate in the description what is left out of this and why (too hard to address? will be addressed in a follow up PR?).
No in fact, it is caused by CFN calls. But right now what typically happens is, our STS calls fail, we decide to recover (which we do for backwards compatibility, "credentials look for the right account anyway"), and then we proceed to call CFN which hangs. If we don't try to do STS AssumeRole calls, we skip the part where we call STS and we immediately go to CFN, which will hang.
This is just generally unrecoverable anyway. Nothing to be done about it. You did give me an idea though: right now we start by calling It'll cause more calls to STS... but at this point I'm not even sure that's something we should care about. |
ha I started writing a review comment asking why we don't just throw the error here and then I realized it's because there is a cache and deleted the comment 🤦♂️ |
We can't take out the cache, it would also affect |
Thank you for contributing! Your pull request will be updated from main and then merged automatically (do not update manually, and be sure to allow changes to be pushed to your fork). |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
Thank you for contributing! Your pull request will be updated from main and then merged automatically (do not update manually, and be sure to allow changes to be pushed to your fork). |
When using environment variable credentials (`AWS_ACCESS_KEY_ID` etc) that were expired, the CLI would proceed to retry calls involving those credentials because the `ExpiredToken` error is marked as `retryable: true`. Because we have extremely aggressive retries for most of our SDK calls (since the CloudFormation throttling limits are low and we generate a lot of contention on them), calls can take up to 10 minutes to run out of retries. Try and detect `ExpiredToken` situations sooner and error out harder without trying to recover from them. This PR only handles the situation where there is a Roles to assume -- this works because calls to STS have a much lower retry count, and so it only takes a couple of seconds to run out of retries and surface the `ExpiredToken` to the CLI, which we can then use to abort early. This is all to work around aws/aws-sdk-js#3581 ---- *By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*
#21052 tried to fix the situation where we would keep on doing retries if AWS credentials were expired. However, this is now failing too hard for people that commonly have expired credentials in their environment but still want to have `cdk synth` complete successfully. Catch and swallow the error (but do complain with a warning) if we encounter an `ExpiredToken` during the `defaultAccount` operation. That's the only place where it's used, and the only place where the value is optional -- it behaves the same as if no credentials were configured. Also in this PR: add some TypeScript decorators to trace through a bunch of async method calls to come up with a reasonable trace of where errors originate. Not complete, not intended to be. But it is a nice basis for debugging SDK call behavior, and can be used more in the future.
#21052 tried to fix the situation where we would keep on doing retries if AWS credentials were expired. However, this is now failing too hard for people that commonly have expired credentials in their environment but still want to have `cdk synth` complete successfully. Catch and swallow the error (but do complain with a warning) if we encounter an `ExpiredToken` during the `defaultAccount` operation. That's the only place where it's used, and the only place where the value is optional -- it behaves the same as if no credentials were configured. Also in this PR: add some TypeScript decorators to trace through a bunch of async method calls to come up with a reasonable trace of where errors originate. Not complete, not intended to be. But it is a nice basis for debugging SDK call behavior, and can be used more in the future. ---- *By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*
When using environment variable credentials (
AWS_ACCESS_KEY_ID
etc)that were expired, the CLI would proceed to retry calls involving those
credentials because the
ExpiredToken
error is marked asretryable: true
.Because we have extremely aggressive retries for most of our SDK calls
(since the CloudFormation throttling limits are low and we generate a
lot of contention on them), calls can take up to 10 minutes to run out
of retries.
Try and detect
ExpiredToken
situations sooner and error out harderwithout trying to recover from them.
This PR only handles the situation where there is a Roles to assume --
this works because calls to STS have a much lower retry count, and so
it only takes a couple of seconds to run out of retries and surface
the
ExpiredToken
to the CLI, which we can then use to abort early.This is all to work around aws/aws-sdk-js#3581
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license