Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: Enhance podIdentity Role Assumption in AWS by Direct Integration with OIDC/Federation and Implement Credentials Cache #5177

Closed
wants to merge 4 commits into from

Conversation

ThaSami
Copy link
Contributor

@ThaSami ThaSami commented Nov 10, 2023

In the existing setup, when AWS roles are overridden through pod identity mechanism or by specifying the awsRoleArn trigger authentication environment variable, KEDA attempts to assume the specified role using its Operator's IAM role. This process necessitates granting the KEDA operator's role the necessary permissions to assume the target role. Additionally, the trust relationship on the target role must be configured to permit this assumption. While this approach aligns well with setups using kube2iam, it introduces a redundant step in environments that utilize IRSA.

The proposed Pull Request (PR) enables KEDA to directly assume AWS roles by using OpenID Connect (OIDC) and federation mechanisms. This direct method bypasses the need to configure additional permissions on the KEDA operator's role. By doing so, it simplifies the role assumption process in IRSA environments, reducing the configuration overhead typically associated with setting up cross-account role permissions and trust policies.
The modification maintains backward compatibility; if role assumption is unsuccessful, it will revert to attempting role assumption via the operator's role

Additionally, this update introduces a credentials cache for AWS roles. When multiple ScaledObjects utilize the same AWS role, this cache significantly cuts down on the number of AWS API calls.

Checklist

  • Tests have been added
  • Changelog has been updated and is aligned with our changelog requirements
  • A PR is opened to update the documentation on (repo) (if applicable)
  • Commits are signed with Developer Certificate of Origin (DCO - learn more)

Fixes #5178
Fixes #5297

@ThaSami ThaSami requested a review from a team as a code owner November 10, 2023 18:41
Copy link

Thank you for your contribution! 🙏 We will review your PR as soon as possible.

While you are waiting, make sure to:

Learn more about:

Signed-off-by: Sami S <25616506+ThaSami@users.noreply.github.com>
Signed-off-by: Sami S <25616506+ThaSami@users.noreply.github.com>
@blakepettersson
Copy link

I think in general this makes sense - perhaps it would be good to showcase how other operators (e.g external-secrets) uses STS, or to link to some reference documentation on how STS is meant to be used by a k8s operator to showcase to people not so familiar with AWS that this is indeed the recommended path to go.

This also ties in with the work @JorTurFer is doing in #5061 and kedacore/keda-docs#1251 so would be good to get his eyes on this as well.

Otherwise it would be good to add some E2E tests for this, but I'm not familiar with how this is done in KEDA so that would also need @JorTurFer's input 😄

@ThaSami
Copy link
Contributor Author

ThaSami commented Nov 23, 2023

Thank you, @blakepettersson, for highlighting the external-secrets approach.
Their method is interesting they provide 3 ways to auth:

  1. Utilizing the operator's IAM role, the same as what we do in KEDA. This includes the option for the controller's role to assume another role via sts:AssumeRole. This is closely related to what we're addressing in the current PR, which aims to modify this.
  2. Employing AWS secret keys.
  3. Leveraging service account tokens, provided that volume projection is enabled — a setting activated by default in EKS environments.

The third option catches my interest. @JorTurFer, it might be worth considering this into the new AWS provider.

While I still believe the work being done in this PR holds some value, we could shift our approach a little bit. Instead of retrieving KEDA's service account token via os.GetEnv(WEB_IDENTITY_TOKEN), we might use External-secret's approach method to dynamically generate the other workload's service account token.
The external-secrets team shows how to generate credentials given the service account's name, as seen in their implementation: Token Fetcher in external-secrets.

@JorTurFer
Copy link
Member

JorTurFer commented Nov 27, 2023

While I still believe the work being done in this PR holds some value, we could shift our approach a little bit. Instead of retrieving KEDA's service account token via os.GetEnv(WEB_IDENTITY_TOKEN), we might use External-secret's approach method to dynamically generate the other workload's service account token.

Honestly, I don't think that this is worth because KEDA could create a token for any identity within the cluster, and that's a huge risk IMHO. I didn't know that external-secrets does this, but now it's a no-go for me. Adding the capability of requesting service account tokens without any limitation is potentially risky because KEDA could request a token for a service account bound with a role that it's not allowed for KEDA, but as KEDA can just request a valid k8s token for that workload, KEDA could by-pass the assumption policy and escalate its privileges. This approach also would require a huge privilege increase in KEDA side, because KEDA needs this permission by RBAC. I strongly disagree with this, but maybe I'm missing something important. @zroubalik WDYT?
I'd like to add also that this approach has been copied from the aws implementation from secrets store CSI and I also think that it's terrible. Their RBAC allows this, so the controller can fake any service account in the cluster without restrictions (including privilege escalation of the component itself, requesting service account token from other workloads with more privileges). This isn't in k8s repos but in aws organization.

About the approach, I have thoughts in conflict with doing the fallback from NewWebIdentityRoleProvider to NewAssumeRoleProvider because it can produce confusion to users, and I'd just select one of them and document it as way to go, but I don't have any strong opinion about doing it or not (if it's well documented). In any case, if we have consensus about doing or not the fallback, and if we agree with not doing it, having consensus about using WebIdentity or not, I can just update my PR to reflect these agreements.

@zroubalik
Copy link
Member

I am with @JorTurFer on this, though to be fair, I don't consider myself an expert on the area.

@JorTurFer
Copy link
Member

JorTurFer commented Dec 7, 2023

Just for update this PR, @ThaSami and I have been talking about directly via slack and we will move this to the new aws auth (the one I have as wip), adding to it some extra features, so the new auth will include also:

  • The fallback from assumeRoleWithWebIdentity to AssumeRole, documenting it well. We agreed on this because SDKs already have a default option that does exactly this, so it's not totally confusing
  • We need to create a session cache on top of SDK CredentialsCache to reduce the amount of requested tokens. This is because currently, each scaler generates it's own token, 100 scalers means 100 tokens although they are for the same ARN. The token for a given ARN has to be shared accross all the scalers that uses the ARN

These to features are in addition to already pending changes in that auth

@zroubalik
Copy link
Member

Just for update this PR, @ThaSami and I have been talking about directly via slack and we will move this to the new aws auth (the one I have as wip), adding to it some extra features, so the new auth will include also:

* The fallback from assumeRoleWithWebIdentity  to AssumeRole, documenting it well. We agreed on this because SDKs already have a default option that does exactly this, so it's not totally confusing

* We need to create a session cache on top of SDK CredentialsCache  to reduce the amount of requested tokens. This is because currently, each scaler generates it's own token, 100 scalers means 100 tokens although they are for the same ARN. The token for a given ARN has to be shared accross all the scalers that uses the ARN

These to features are in addition to already pending changes in that auth

Excellent, I really like the cache!

@ThaSami ThaSami changed the title Enhance podIdentity Role Assumption in AWS by Direct Integration with OIDC/Federation Feat: Enhance podIdentity Role Assumption in AWS by Direct Integration with OIDC/Federation and Implement Credentials Cache Dec 17, 2023
@ThaSami
Copy link
Contributor Author

ThaSami commented Dec 17, 2023

Thanks, @JorTurFer @zroubalik for the insights

Following our conversation on Slack with Jorge, I've implemented a cache for AWS Role credentials. This cache monitors each AWS role's usage and performs cleanup only when a scaler is removed. I would appreciate your input.

@JorTurFer
Copy link
Member

Thanks for your contribution! As we have talked about offline, I close this PR in favor of #5061 which adds this functionality to the new auth.

@JorTurFer JorTurFer closed this Dec 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants