Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(cli): CLI hangs for 10 minutes on expired credentials #21052

Merged
merged 7 commits into from
Aug 19, 2022

Conversation

rix0rrr
Copy link
Contributor

@rix0rrr rix0rrr commented Jul 8, 2022

When using environment variable credentials (AWS_ACCESS_KEY_ID etc)
that were expired, the CLI would proceed to retry calls involving those
credentials because the ExpiredToken error is marked as retryable: true.

Because we have extremely aggressive retries for most of our SDK calls
(since the CloudFormation throttling limits are low and we generate a
lot of contention on them), calls can take up to 10 minutes to run out
of retries.

Try and detect ExpiredToken situations sooner and error out harder
without trying to recover from them.

This PR only handles the situation where there is a Roles to assume --
this works because calls to STS have a much lower retry count, and so
it only takes a couple of seconds to run out of retries and surface
the ExpiredToken to the CLI, which we can then use to abort early.

This is all to work around aws/aws-sdk-js#3581


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license

When using environment variable credentials (`AWS_ACCESS_KEY_ID` etc)
that were expired, the CLI would proceed to retry calls involving those
credentials because the `ExpiredToken` error is marked as `retryable:
true`.

Because we have extremely aggressive timeouts for most of our SDK calls
(since the CloudFormation throttling limits are low and we generate a
lot of contention on them), calls can take up to 10 minutes to run out
of retries.

Try and detect `ExpiredToken` situations sooner and error out harder
without trying to recover from them.

This PR only handles the situation where there is a Roles to assume --
this works because calls to STS have a much lower retry count, and so
it only takes a couple of seconds to run out of retries and surface
the `ExpiredToken` to the CLI, which we can then use to abort early.
@rix0rrr rix0rrr requested a review from a team July 8, 2022 08:27
@rix0rrr rix0rrr self-assigned this Jul 8, 2022
@gitpod-io
Copy link

gitpod-io bot commented Jul 8, 2022

@github-actions github-actions bot added the p2 label Jul 8, 2022
@aws-cdk-automation aws-cdk-automation requested a review from a team July 8, 2022 08:27
@mergify mergify bot added the contribution/core This is a PR that came from AWS. label Jul 8, 2022
Copy link
Contributor

@corymhall corymhall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

My only question is the PR description seems to imply that there are other places that this logic should be implemented (you mention CFN calls as an issue). Is the 10 minute hanging only caused by these STS calls? If we obtain valid credentials, start a deployment, and then they expire is there still an issue?

It might be helpful to indicate in the description what is left out of this and why (too hard to address? will be addressed in a follow up PR?).

@corymhall corymhall added the pr/do-not-merge This PR should not be merged at this time. label Jul 8, 2022
@rix0rrr
Copy link
Contributor Author

rix0rrr commented Jul 8, 2022

Is the 10 minute hanging only caused by these STS calls?

No in fact, it is caused by CFN calls. But right now what typically happens is, our STS calls fail, we decide to recover (which we do for backwards compatibility, "credentials look for the right account anyway"), and then we proceed to call CFN which hangs.

If we don't try to do STS AssumeRole calls, we skip the part where we call STS and we immediately go to CFN, which will hang.

If we obtain valid credentials, start a deployment, and then they expire is there still an issue?

This is just generally unrecoverable anyway. Nothing to be done about it.


You did give me an idea though: right now we start by calling GetCallerIdentity to get the account number for the current credentials, and we have a cache on that to not call it unnecessarily. If we remove the cache, we'll always have a way to check the validity of the current credentials -- at least, at the start of the operation.

It'll cause more calls to STS... but at this point I'm not even sure that's something we should care about.

@corymhall
Copy link
Contributor

and we have a cache on that to not call it unnecessarily. If we remove the cache, we'll always have a way to check the validity of the current credentials -- at least, at the start of the operation.

ha I started writing a review comment asking why we don't just throw the error here and then I realized it's because there is a cache and deleted the comment 🤦‍♂️

@rix0rrr
Copy link
Contributor Author

rix0rrr commented Aug 12, 2022

We can't take out the cache, it would also affect cdk synth operations, not just cdk deploy.

@rix0rrr rix0rrr removed the pr/do-not-merge This PR should not be merged at this time. label Aug 19, 2022
@mergify
Copy link
Contributor

mergify bot commented Aug 19, 2022

Thank you for contributing! Your pull request will be updated from main and then merged automatically (do not update manually, and be sure to allow changes to be pushed to your fork).

@aws-cdk-automation
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: AutoBuildv2Project1C6BFA3F-wQm2hXv2jqQv
  • Commit ID: 30553b9
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

@mergify mergify bot merged commit 1e305e6 into main Aug 19, 2022
@mergify mergify bot deleted the huijbers/no-retry-expired branch August 19, 2022 13:14
@mergify
Copy link
Contributor

mergify bot commented Aug 19, 2022

Thank you for contributing! Your pull request will be updated from main and then merged automatically (do not update manually, and be sure to allow changes to be pushed to your fork).

josephedward pushed a commit to josephedward/aws-cdk that referenced this pull request Aug 30, 2022
When using environment variable credentials (`AWS_ACCESS_KEY_ID` etc)
that were expired, the CLI would proceed to retry calls involving those
credentials because the `ExpiredToken` error is marked as `retryable:
true`.

Because we have extremely aggressive retries for most of our SDK calls
(since the CloudFormation throttling limits are low and we generate a
lot of contention on them), calls can take up to 10 minutes to run out
of retries.

Try and detect `ExpiredToken` situations sooner and error out harder
without trying to recover from them.

This PR only handles the situation where there is a Roles to assume --
this works because calls to STS have a much lower retry count, and so
it only takes a couple of seconds to run out of retries and surface
the `ExpiredToken` to the CLI, which we can then use to abort early.

This is all to work around aws/aws-sdk-js#3581

----

*By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*
rix0rrr added a commit that referenced this pull request Nov 10, 2022
#21052 tried to fix the situation
where we would keep on doing retries if AWS credentials were expired.

However, this is now failing too hard for people that commonly have
expired credentials in their environment but still want to have
`cdk synth` complete successfully.

Catch and swallow the error (but do complain with a warning) if we
encounter an `ExpiredToken` during the `defaultAccount` operation.
That's the only place where it's used, and the only place where the
value is optional -- it behaves the same as if no credentials were
configured.

Also in this PR: add some TypeScript decorators to trace through a bunch
of async method calls to come up with a reasonable trace of where errors
originate. Not complete, not intended to be. But it is a nice basis
for debugging SDK call behavior, and can be used more in the future.
mergify bot pushed a commit that referenced this pull request Nov 10, 2022
#21052 tried to fix the situation where we would keep on doing retries if AWS credentials were expired.

However, this is now failing too hard for people that commonly have expired credentials in their environment but still want to have `cdk synth` complete successfully.

Catch and swallow the error (but do complain with a warning) if we encounter an `ExpiredToken` during the `defaultAccount` operation. That's the only place where it's used, and the only place where the value is optional -- it behaves the same as if no credentials were configured.

Also in this PR: add some TypeScript decorators to trace through a bunch of async method calls to come up with a reasonable trace of where errors originate. Not complete, not intended to be. But it is a nice basis for debugging SDK call behavior, and can be used more in the future.


----

*By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
contribution/core This is a PR that came from AWS. p2
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants