Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS Pod identities need to be reviewed #4134

Closed
Tracked by #5275
JorTurFer opened this issue Jan 18, 2023 · 27 comments · Fixed by #5061
Closed
Tracked by #5275

AWS Pod identities need to be reviewed #4134

JorTurFer opened this issue Jan 18, 2023 · 27 comments · Fixed by #5061
Labels
bug Something isn't working stale-bot-ignore All issues that should not be automatically closed by our stale bot

Comments

@JorTurFer
Copy link
Member

JorTurFer commented Jan 18, 2023

Report

It's mandatory to define serviceAccountName in the scaled workload because there is some code that read the awsRoleArn even it's not always necessary for this authentication (the SDK reads the AWS role from the KEDA service account).

My proposal is to deprecate identityOwner in favor of a new option in TriggerAuthentication (as we have in with azure pod identity).

aws-kiam should be removed as it has been deprecated in favor of aws-eks. We should deprecate it and remove in KEDA v2.12.

@JorTurFer JorTurFer added the bug Something isn't working label Jan 18, 2023
@JorTurFer JorTurFer self-assigned this Jan 18, 2023
@blakepettersson
Copy link

It would still be useful if we could assume other roles from the keda operator, could we have a roleArn attribute somewhere in TriggerAuthentication, which the KEDA operator could then assume?

@JorTurFer
Copy link
Member Author

Do you mean something like this in azure?
image

We did it to prevent cases where the same identity (role in aws terms) stack a lot of permissions, there you can federate multiple identities with the same k8s service account and KEDA takes one or another based on the TriggerAuthentication

It should be doable if you meant that, I'm not 100% sure, but it's just to give a try.

Are you willing to contribute?

@blakepettersson
Copy link

I'm not familiar with Azure but I think that's the same concept. The KEDA service account would have an IAM role with a policy which would allow it to take on (assume) other roles, like the example below.

{
    "Version": "2012-10-17",
    "Statement": [{
        "Action": "sts:AssumeRole",
        "Effect": "Allow"
        "Resource": "arn:aws:iam::*:role/grant-keda-access"
      }]
}

If we were to have a awsRoleArn in the TriggerAuthentication CRD, or by using the awsRoleArn that's available in the various scalers, the KEDA operator would then assume the given role, as long as the appropriate assume role policy has been granted by the awsRoleArn (something like this in Terraform syntax)

resource "aws_iam_role" "give-keda-access" {
  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Action = "sts:AssumeRole"
        Effect = "Allow"
        Principal = {
          AWS = "arn:aws:iam::$my-account-id:role/my-keda-operator-role"
        }
      },
    ]
  })
  name = "grant-keda-access"
}

I took a look at the code and it seems like this should work, but never really got it to work without doing what you were describing (by defining a service account name etc, which seems unnecessary). It seems like it should be enough to remove all references to identityOwner and to modify getAwsConfig to be something like this:

func getAwsConfig(awsRegion string, awsEndpoint string, awsAuthorization awsAuthorizationMetadata) (*session.Session, *aws.Config) {
	metadata := &awsConfigMetadata{
		awsRegion:        awsRegion,
		awsEndpoint:      awsEndpoint,
		awsAuthorization: awsAuthorization}

	sess := session.Must(session.NewSession(&aws.Config{
		Region:   aws.String(metadata.awsRegion),
		Endpoint: aws.String(metadata.awsEndpoint),
	}))

	var creds *credentials.Credentials

	if metadata.awsAuthorization.awsRoleArn != "" {
		creds = stscreds.NewCredentials(sess, metadata.awsAuthorization.awsRoleArn)
	} else if metadata.awsAuthorization.awsAccessKeyID != "" && metadata.awsAuthorization.awsSecretAccessKey != "" {
		creds = credentials.NewStaticCredentials(metadata.awsAuthorization.awsAccessKeyID, metadata.awsAuthorization.awsSecretAccessKey, "")
	}

	return sess, &aws.Config{
		Region:      aws.String(metadata.awsRegion),
		Endpoint:    aws.String(metadata.awsEndpoint),
		Credentials: creds,
	}
}

I'd be willing to take a deeper look into this and send a PR if I get this to work 😄

@JorTurFer
Copy link
Member Author

JorTurFer commented Jan 19, 2023

Let's wait for removing the identityOwner because maybe I'm wrong and it works somehow (I'm not an expert in AWS) because there is a user in slack who has it working: https://kubernetes.slack.com/archives/CKZJ36A5D/p1674015373425479

@JorTurFer
Copy link
Member Author

Okey, I have discovered how it works:
when identityOwner is empty or identityOwner=pod, it uses the ARN from the workload service account instead of environment variables from KEDA pod. The implementation is quite confusing and forces some things, but it works.

@tomkerkhove tomkerkhove moved this from Proposed to To Triage in Roadmap - KEDA Core Feb 16, 2023
@JorTurFer JorTurFer moved this from To Triage to To Do in Roadmap - KEDA Core Mar 13, 2023
@stale
Copy link

stale bot commented Mar 21, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale All issues that are marked as stale due to inactivity label Mar 21, 2023
@JorTurFer JorTurFer added stale-bot-ignore All issues that should not be automatically closed by our stale bot and removed stale All issues that are marked as stale due to inactivity labels Mar 21, 2023
@JorTurFer JorTurFer removed their assignment Mar 21, 2023
@JorTurFer
Copy link
Member Author

I have been thinking about this, and I believe that creating a new authentication provider for AWS, with a good design, providing the same capabilities that azure has, is a good option. Once this new one is ready, I'd deprecate the other 2 (they are a bit caothic and complex to maintain)
WDYT @kedacore/keda-contributors ?

@tomkerkhove
Copy link
Member

I'm fine if we are sure that the potentially new provider works perfectly and has full parity

@JorTurFer
Copy link
Member Author

I'm fine if we are sure that the potentially new provider works perfectly and has full parity

The problem is that the current providers are really poor documented, and definitively they have some weird configurations (adding e2e test for them was horrible due to that). That's why I'd redesign the AWS identity in KEDA based on Azure identity (which is really nice IMO), documenting it properly

@jeevanragula
Copy link
Contributor

We are also facing some issues and the authentication using PodIdentity is quite confusing.
We noticed that even after specifying podIdentity.provider as aws-eks the keda is trying to assume the role using the node role. Ideally it should use WebIdentity associated with the deployment pod.

  Warning  KEDAScalerFailed  24s (x6 over 2m54s)  keda-operator  (combined from similar events): AccessDenied: User: arn:aws:sts::1111122222:assumed-role/dev_node_group_role/i-0d8345ca0020 is not authorized to perform: sts:AssumeRole on resource: arn:aws:iam::1111122222:role/sqs-event-processor
           status code: 403, request id: 2b1a303f-95c4-4ad0-874c-16c705f3ae0c

@JorTurFer
Copy link
Member Author

We noticed that even after specifying podIdentity.provider as aws-eks the keda is trying to assume the role using the node role. Ideally it should use WebIdentity associated with the deployment pod.

Have you updated KEDA deployment with the needed information? https://github.com/kedacore/charts/blob/main/keda/values.yaml#L210
If you enable pod identity without assigning and identity to the deployment (you have to restart it too for mutating the new pod), it will try to use the node role. Could you check if KEDA pod has been correctly mutated with the required envs?

What KEDA version are you using?

@ghost
Copy link

ghost commented Oct 6, 2023

Using podIdentity.provider: aws-eks means KEDA service account will have following type of annotation:

apiVersion: v1
automountServiceAccountToken: true
kind: ServiceAccount
metadata:
  annotations:
    eks.amazonaws.com/audience: sts.amazonaws.com
    eks.amazonaws.com/role-arn: arn:aws:iam::214272731565:role/example-async-dev-ew1-queuescaling-role

That means in a cluster, with using aws-eks it will only work with only one IAM role and it will become problem for shared cluster where multiple tenants use different IAM role and currently it's not possible to attach multiple IAM role as annotation to KEDA service account.

Do we have any workaround for this?

@JorTurFer
Copy link
Member Author

Do we have any workaround for this?

Currently no (or at least now well documented), that's why this issue is opened :)
In slack threads there is a workaround for that given by a user

@JorTurFer
Copy link
Member Author

JorTurFer commented Oct 9, 2023

I think that it's time to tackle this issue from our side because there haven't been willing contributors xD
Let me make a quick proposal for the new podIdentity (hopping to be more clear and easy), the proposal is:

  • introduce a new podIdentity
  • if the TriggerAuthentication doesn't override the role, KEDA will use its own role for the scaler
  • if the TriggerAuthentication do override the role, KEDA will assume the workload role, and use it for the scaler
  • this management/decision path will be done based on having or not another role within TriggerAuthentication (no other fields will be required on scaler metadata)
  • this behaviour will be covered (IDK how yet) with e2e tests
  • deprecate and eventually remove aws-eks and aws-kiam

@tomkerkhove @zroubalik ?

@tomkerkhove
Copy link
Member

Why do we need to deprecate aws-eks and introduce a new one? I'm not sure I get that.

@JorTurFer
Copy link
Member Author

JorTurFer commented Oct 10, 2023

aws-kiam has to be deprecated/removed because the product itself it has been deprecated and abandoned in favor of aws-eks.

IMHO, aws-eks should be deprecated too because there are some important problems:

  • Missing documentation and e2e tests
  • The feature is coupled between different "layers". You have to specify it in TriggerAuhtentication but also in scalers. This is too different from the approach we follow for example in Azure podIdentity. The consistency is always important.
  • The code is (at least to me) confusing and complex to follow (and maintain).
  • The name suggests that only works for EKS and it works on any cluster indeed

But! After working on this a bit, I think that I have discovered how it works (I'm writing e2e tests for new podIdentity draft) and maybe we could just document properly how aws-eks works, mark as deprecated and keep it there until next major release.
But in any case, I'd add the new podIdentity, unifying the behaviour with Azure podIdentities and moving the podIdentity logic from scalers to the TriggerAuthentication.

@zroubalik
Copy link
Member

I agree with the approach proposed by @JorTurFer. If there's a problem with deprecation, we can keep the current one around as long as needed.

Great job!

@tomkerkhove
Copy link
Member

aws-kiam has to be deprecated/removed because the product itself it has been deprecated and abandoned in favor of aws-eks.

Definately fine for me!

IMHO, aws-eks should be deprecated too because there are some important problems:

  • Missing documentation and e2e tests
  • The feature is coupled between different "layers". You have to specify it in TriggerAuhtentication but also in scalers. This is too different from the approach we follow for example in Azure podIdentity. The consistency is always important.
  • The code is (at least to me) confusing and complex to follow (and maintain).
  • The name suggests that only works for EKS and it works on any cluster indeed

But these are not really reasons why we should deprecate it though? It's just in a bad state.

But! After working on this a bit, I think that I have discovered how it works (I'm writing e2e tests for new podIdentity draft) and maybe we could just document properly how aws-eks works, mark as deprecated and keep it there until next major release. But in any case, I'd add the new podIdentity, unifying the behaviour with Azure podIdentities and moving the podIdentity logic from scalers to the TriggerAuthentication.

What is this new pod identity then? I'm not sure I get the full proposal here :)

@JorTurFer
Copy link
Member Author

But these are not really reasons why we should deprecate it though? It's just in a bad state.

I don't agree with that. Based on our deprecation policies, we have a few options to remove things, and that generates technical debt. Current approach is totally coupled to the trigger. metadata, and we can't change it based on our deprecation policies. It's not just a single reason, is the whole list together, poor docs, e2e test gaps, too much complex code, alignment gap with AzurepodIdentity, confusing name (EKS is not IRSA, AWS IRSA means roles, AWS EKS means k8s, aws-eks suggest something for EKS and not for IRSA).

We can't address all of them, that's why IMHO redesigning it totally, testing and documenting it is better than adding pore patches and increasing technical debt.

We could introduce the new one, and deprecate the others. In this case, I agree with maintaining aws-eks there as deprecated without new features. This helps us because instead of adding it as a breaking change at some time (nvm if it's during major version or minor), we have warned users and they probable would have migrated to the new one.

@JorTurFer
Copy link
Member Author

JorTurFer commented Oct 10, 2023

I'm worried about keeping any kind of code just to not break something, because the technical debt grows and it can be a problem... Not documented + not tested code is potentially technical debt, even more in our case, where random contributors add the code and we don't see them anymore.

For example: In gcp podIdentity, the contributor was going to introduce the same coupled approach (not being needed because in gcp we don't allow using other identities and their code didn't do it), just because they saw it in already existing code.

@JorTurFer
Copy link
Member Author

JorTurFer commented Oct 10, 2023

Said this, If I'm the only one who thinks that this is required to improve in general (code, standardization, etc..) , we can just add some e2e tests to current approach and close this issue (deprecating&removing the deprecated aws-kiam, I think we've agreed on this).

I can be missing some important point or maybe my opinion is influenced by my knowledge gap in AWS

@zroubalik
Copy link
Member

Said this, If I'm the only one who thinks that this is required to improve in general (code, standardization, etc..) , we can just add some e2e tests to current approach and close this issue (deprecating&removing the deprecated aws-kiam, I think we've agreed on this).

I agree with you, improvements is needed for sure.

@tomkerkhove
Copy link
Member

What I'm not fully getting is that we want to deprecate 2 trigger auth types, but introduce a new one. So how will this new one be better? Did they announce a new authentication type, or is it simply because we want to re-work it so that it's better maintained and we just don't like aws-eks as a name?

@tomkerkhove
Copy link
Member

I'm not saying we can't deprecate them both, just trying to understand the reasoning for aws-eks and what this new thing is

@JorTurFer
Copy link
Member Author

JorTurFer commented Oct 10, 2023

Honestly, the only reason to deprecate the old one (let's talk about aws-eks only because the cause behind aws-kiam deprecation is just upstream deprecation in favor of irsa, nothing to do here) is because I see it more easy to introduce all the required changes with no impact for users xD

I mean, I'd agree with extending current aws-eks with the new changes in TriggerAuthentication if we would be going to deprecate and remove the trigger configuration part in a reasonable time (2-3 versions). If we just extend current aws-eks with the new options, we will create more confusing in users (more options for doing the same), and we will introduce more complexity to the code, because we will need to take into account the precedence to not break something.

In the other hand, creating new authentication, we have the freedom to design it as we want, not having to be fully compatible because it's another different authentication (based on the same? Yes, but with totally different "entity") . From coding pov, this will be so much better for adding, but also for removing the old code. From users pov, this is a migrations that thay have to do explicitly and that's why I also agree with maintaining aws-eks as deprecated and the new one until next major version, helping users with the migration path.

Summarizing, the only reason for adding a new auth and deprecating the previous one is for not having troubles with breaking changes and deprecation policies, as removing the field from scaler auth section and adding it as first class property of TA is a breaking change and personally I'm against supporting both places in the same auth to make our lifes happier xD

@tomkerkhove
Copy link
Member

I think I'm not fully grasping it, still - SOrry :)

Does that mean we currently have it defined on the scalers instead of trigger authentication then? If not, what would the new trigger authentication look like compared to what we have today?

I think deprecating an existing trigger authentication just to change things might be a bit aggressive and cause end-user frustration we can avoid; but I am probably missing something. I want to avoid cognitive load on end-users

@tomkerkhove
Copy link
Member

Just synced up with Zbynek and if we're doing this to move it from trigger metadata to auth then I'm fine with it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale-bot-ignore All issues that should not be automatically closed by our stale bot
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants