Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set fake AWS credentials on controller to workaround aws-sdk bug #4073

Merged
merged 1 commit into from Jul 12, 2021
Merged

Set fake AWS credentials on controller to workaround aws-sdk bug #4073

merged 1 commit into from Jul 12, 2021

Conversation

ghost
Copy link

@ghost ghost commented Jul 2, 2021

Workaround for issue #4087 while we look into better solutions.

Changes

Several issues have now reared their head which are directly caused
by an update to the aws-sdk. The update results in extremely long
delays in the execution of tasks after the Pipelines controller is
first deployed in a cluster. The aws-sdk is initialized through
a transitive dependency that pipelines pulls in via k8schain or go-containerregistry.

Here are the recent issues directly related to this bug:

One quick way to work around this problem is to set fake AWS
credentials in the environment of the deployed controller. This
apparently causes the aws-sdk to skip whatever process it has
introduced that causes massive delays in execution. So this commit
does exactly that - set fake aws creds in the deployments env vars.

This is an unfortunate hack to try and mitigate the problem until
a better solution presents itself. Ideally go-containerregistry or
k8s-pkg-credentialprovider would provide a way for us to disable
AWS SDK initialization to avoid this misbehaviour.

/kind misc

Submitter Checklist

As the author of this PR, please check off the items in this checklist:

Release Notes

Work around a bug in the AWS go SDK that causes extremely long delays in task startup times.

@tekton-robot tekton-robot added release-note-none Denotes a PR that doesnt merit a release note. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/misc Categorizes issue or PR as a miscellaneuous one. labels Jul 2, 2021
@tekton-robot tekton-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Jul 2, 2021
@ghost ghost closed this Jul 2, 2021
@ghost ghost reopened this Jul 9, 2021
Several issues have now reared their head which are directly caused
by an update to the aws-sdk. The update results in extremely long
delays in the execution of tasks after the Pipelines controller is
first deployed in a cluster. The aws-sdk is initialized through
a transitive dependency that pipelines pulls in via k8schain.

Here are the recent issues directly related to this aws-sdk bug:

- #3627 (Since December!)
- #4084

One quick way to work around this problem is to set fake AWS
credentials in the environment of the deployed controller. This
apparently causes the aws-sdk to skip whatever process it has
introduced that causes massive delays in execution. So this commit
does exactly that - set fake aws creds in the deployments env vars.

This is an unfortunate hack to try and mitigate the problem until
a better solution presents itself.
@tekton-robot tekton-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jul 9, 2021
@ghost ghost changed the title Temp commit, just seeing if adding these env vars speeds up our PR CI Set fake AWS credentials on controller to workaround aws-sdk bug Jul 9, 2021
@tekton-robot tekton-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesnt merit a release note. labels Jul 9, 2021
@ghost
Copy link
Author

ghost commented Jul 9, 2021

/hold cancel

@tekton-robot tekton-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 9, 2021
Copy link
Collaborator

@bobcatfish bobcatfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for figuring this out @sbwsg ! This seems fine as a short term fix but I think we should at least track this somewhere so we can come back to it (i think you might be working on this already but id like to track it anyway just in case)

im also wondering what (if any) implication this might have for folks who are actually using AWS credentials - might be worth asking if someone using AWS is willing to try this out and make sure it doesnt cause any issues for them

/approve

@@ -93,6 +93,16 @@ spec:
valueFrom:
fieldRef:
fieldPath: metadata.namespace
# These phony AWS credentials are here to work around a bug in the aws go sdk
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we add an issue to track this and mention it in the comment? (either here or in go-containerregistry) 🙏

basically just trying to make sure when someone stumbles on this years from now cuz they need to change it that they know where to go to see what the latest state is (i mean they can always track it through the blame but since this is kind of a hack workaround, it feels like we'd like to maybe pursue a long term solution at some point)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, we should track this in its own parent issue and probably open or identify an aws-sdk-go issue that's at the bottom of this turtle stack.

@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bobcatfish

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 9, 2021
@imjasonh
Copy link
Member

imjasonh commented Jul 9, 2021

+1 on making this change as a short-term stopgap. Another route to fixing this that we should explore:

If aws-sdk-go has fixed the bug since v1.31.12 (which is the version we depend on, a year+ old), we should try to see if this bug goes away just by upgrading dependencies.

We currently depend on k8schain from January; we could upgrade to the latest release, v0.5.1, which depends on k8s-pkg-credentialprovider v1.21.0-1, which depends on aws-sdk-go v1.35.24 (November 2020)

The latest aws-sdk-go is v1.39.4 (two hours ago), we could see if upgrading that dependency then upgrading k8schain then upgrading tektoncd/pipeline fixes the bug.

@ghost
Copy link
Author

ghost commented Jul 12, 2021

@bobcatfish @imjasonh I've created #4087 to capture the problem with all the notes I've been piecing together.

@ghost
Copy link
Author

ghost commented Jul 12, 2021

Updating the SDK may well help indeed, have just seen that aws/aws-sdk-go#3066 from back in January reduces a timeout in requests to the ec2 metadata service. I am super confused how we are seeing timeout times of ~75 seconds per connection attempt while their timeout configuration only had it at 5 seconds prior to that PR though. I might not be fully grokking what's actually going wrong here.

@afrittoli
Copy link
Member

@imjasonh @sbwsg I'd be happy to merge that asap even as a temporary workaround - so that it might still get into the next release or minor release - wdyt? @pritidesai

@pritidesai
Copy link
Member

this is really phony but as long as its a temporary workaround, I am fine with this, thanks a bunch @sbwsg 🙏

/lgtm

@tekton-robot tekton-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 12, 2021
@tekton-robot tekton-robot merged commit 49a7fa2 into tektoncd:main Jul 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/misc Categorizes issue or PR as a miscellaneuous one. lgtm Indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants