Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Webhook not injecting Env Vars or Volumes/VolumeMounts on initial deployment to a new cluster #174

Open
ahuffman opened this issue Feb 7, 2023 · 6 comments

Comments

@ahuffman
Copy link

ahuffman commented Feb 7, 2023

What happened:
Deploying multiple services on my cluster such as cluster-autoscaler, external-dns, ebs-csi-drivers. On initial deployment the pods do not receive the environment vars, volumes, and volumeMounts.

When I manually delete the affected pods after automated deployment, I get everything as expected from the webhook.

I've followed every possible AWS document on troubleshooting IRSA. I initially thought it could be a race condition post cluster instantiation, but I tested delaying the deployments as long as 10 minutes and the results are the same.

What you expected to happen:
Environment vars, volumes, and volumeMounts are injected into the deployment's pod specs without need for manual deletion of the pods.

How to reproduce it (as minimally and precisely as possible):
Create an EKS cluster, create an IAM OIDC provider, Create a IAM Policy, Create a Role, Attach the Policy to the Role, Create a trust relationship in the role referring to the OIDC provider and the Kubernetes service account, and finally do a helm release with appropriate values to specify the corresponding namespace, service account name, and annotations required for the roleARN to tie it all together.

Anything else we need to know?:
Not that I can think of, but feel free to ask for more :).

Environment:

  • AWS Region: us-east-1 (have tested in many with same result)
  • EKS Platform version (if using EKS, run aws eks describe-cluster --name <name> --query cluster.platformVersion): eks.3
  • Kubernetes version (if using EKS, run aws eks describe-cluster --name <name> --query cluster.version): 1.24 (also tested 1.23 with same result)
  • Webhook Version: ? whatever comes with EKS 1.24.8
@mohammadasim
Copy link

I am experiencing the same issue. Our cluster is a Kops managed cluster. We deployed a service that had two replicas. We noticed that one of the pod was able to access the s3 bucket but the other wasn't. On investigation, the pod that was not able to access the bucket didn't have AWS_ROLE_ARN AWS_WEB_IDENTITY_TOKEN_FILE environment variables. The pod that was able to access the s3 bucket had these environment variables. I checked both pods had the same service account. When I deleted the pod with the missing environment variables, the new pod created had these two environment variables.
My kubernetes version is v1.22.5. The amazon-eks-pod-identity-webhook version is as follows
Image: amazon/amazon-eks-pod-identity-webhook:latest Image ID: docker-pullable://amazon/amazon-eks-pod-identity-webhook@sha256:4a3ff337b6549dd29a06451945e40ba3385729c879f09f264d3e37d06cbf001a
Any information will be highly appreciated.

@cjellick
Copy link

cjellick commented Mar 3, 2023

This looks very similar to an issue we are hitting. Here was our conclusion (courtesy of @StrongMonkey):

When diving deep, this is because a race condition in aws-identity pod where serviceaccount and pod get created at the same time. It uses cache to fetch serviceaccount, which might not be ready when the pod was created.

@jsilverio22 should be able to share a WIP PR soon.

@cjellick
Copy link

cjellick commented Mar 3, 2023

In our case, we are probably exacerbating the problem by:

  1. creating the deployment before we create the serviceAccount (Its programatic and very close together, so there isn't a long delay, but maybe just enough for a race)
  2. Using this webhook in k3s, which might make it worse because in an HA setup in k3s watch events could get delayed 2 seconds

@ahuffman
Copy link
Author

ahuffman commented Mar 3, 2023

In my case, I'm running on EKS, however I'm performing my deployments via Helm charts, where the ServiceAccounts and related annotations are being deployed via the values at the same time (using the charts).

I also tried pre-provisioning the ServiceAccount with the annotations, but it did not change the behavior, and lead me to believe it was something else altogether.

In a similar situation to @cjellick , my entire cluster provisioning process is being done programmatically via Crossplane. Just to reiterate, I do not believe it's a problem with my configurations, because when I delete the pods after their first instantiation everything works fine, but the initial deployment of the pods do not pick up their IRSA privileges.

@ekristen
Copy link

I'm seeing this problem on simple re-deployments. Like a deployment recreates the pods for reasons (like moving nodes) and environment variables simply won't be put in place. I'm running the pod identity webhook on multiple nodes so there should always be one online to respond.

@rlister
Copy link

rlister commented Feb 22, 2024

This is a blocker for us to adopt pod identity (as a switch from IRSA). We create IAM Role (using ACK controller), PodIdentityAssociation, ServiceAccount and Deployment in the same helm chart. On initial install, pods come up without the identity mutation, winning a race we want them to lose. This occurs even hardcoding roleARN in the PodIdentityAssociation, ie we are not waiting for status on the Role.

Installing PodIdentityAssociation in a helm pre-install hook does not help: the resource comes up and is ready very quickly, but our pods still come up before mutation is ready.

Waiting a minute and restarting Deployment gives us new pods with correct mutations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants