Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS nodes become permanently unreachable after updating aws-auth ConfigMap #1847

Closed
iridian-ks opened this issue Dec 23, 2021 · 5 comments · Fixed by #1926
Closed

AWS nodes become permanently unreachable after updating aws-auth ConfigMap #1847

iridian-ks opened this issue Dec 23, 2021 · 5 comments · Fixed by #1926
Labels
impact/usability Something that impacts users' ability to use the product easily and intuitively kind/bug Some behavior is incorrect or out of spec resolution/fixed This issue was fixed
Milestone

Comments

@iridian-ks
Copy link

Hello!

  • Vote on this issue by adding a 👍 reaction
  • To contribute a fix for this issue, leave a comment (and link to your pull request, if you've opened one already)

Issue details

I'm managing k8s worker nodes in a PCI-zone. We manage firewall rules by EC2 tags (Palo Alto). This means, each PCI-zoned workload needs to be scheduled on the right nodes. I've automated provisioning different EKS node groups with Pulumi. Each node group gets it's own dedicated IAM role as part of this.

In EKS, in order for nodes to join the cluster, their IAM role needs to be in the aws-auth ConfigMap. When a new PCI workload comes along I append the list of node groups for the new workload and run Pulumi which does everything perfectly in spinning up that new node group in the EKS cluster.

The problem is part of this is updating aws-auth. Pulumi only supported Deleting and then Creating ConfigMaps in general, which is certainly best practice for pretty much all use-cases and this bug is really an AWS issue, but when Pulumi does this the NodeGroup is permanently stuck as unschedulable and I need to re-create the entire NodeGroup.

What Pulumi could offer is an immutable field in the ConfigMapArgs that defaults to True to keep with the current behavior but allow individual users to decide whether or not to set it to False if they are encountering a use-case like this one.

Steps to reproduce

  1. Create EKS cluster
  2. Create node group with dedicated IAM and aws-auth ConfigMap to allow cluster joining
  3. Create new node group in Pulumi with a new IAM role and update the ConfigMap with both IAM roles

Expected: New and existing node groups would be in the EKS cluster
Actual: Existing node groups can no longer join EKS cluster

@iridian-ks iridian-ks added the kind/bug Some behavior is incorrect or out of spec label Dec 23, 2021
@ruckc
Copy link

ruckc commented Dec 23, 2021

I've seen this when the updated aws-auth map doesn't contain the account creating the EKS cluster.

@iridian-ks
Copy link
Author

iridian-ks commented Dec 23, 2021

I don't think there's an issue with the ConfigMap. I re-create the NodeGroup after they become broken with a pulumi up --target-replace urn=...NodeGroup..., which will go in and delete all the nodes and re-create them and everything works again.

This tells me that everything is fine except for a moment in time where kubelet can't join? I imagine it's either a kubelet or EKS issue, but it's avoidable if Pulumi doesn't delete the ConfigMap.

Apologies if I'm not fully understanding.

@viveklak
Copy link
Contributor

What Pulumi could offer is an immutable field in the ConfigMapArgs that defaults to True to keep with the current behavior but allow individual users to decide whether or not to set it to False if they are encountering a use-case like this one.

It does appear we don't really take the immutable field in the configmap as a hint to override the current replace logic. cc @lblackstone for thoughts here.

@viveklak viveklak added the impact/usability Something that impacts users' ability to use the product easily and intuitively label Dec 28, 2021
@lblackstone
Copy link
Member

These links are related:
#1568 (comment)
#1775

To summarize, we're planning to use the replaceOnChanges resource option to make the replace behavior user-configurable rather than embedding that logic in the provider. This should give you the flexibility required to make this work.

@viveklak
Copy link
Contributor

While pulumi/pulumi#9158 is still open, we are adding a new provider config key in v3.17.0 to treat configmaps as immutable by default.

@lukehoban lukehoban added this to the 0.70 milestone Mar 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
impact/usability Something that impacts users' ability to use the product easily and intuitively kind/bug Some behavior is incorrect or out of spec resolution/fixed This issue was fixed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants