Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EKS Node Group fails to recreate when using launch template, on minor template update #1152

Closed
jcam opened this issue Dec 19, 2020 · 24 comments · Fixed by #1372
Closed

EKS Node Group fails to recreate when using launch template, on minor template update #1152

jcam opened this issue Dec 19, 2020 · 24 comments · Fixed by #1372

Comments

@jcam
Copy link

jcam commented Dec 19, 2020

I have issues (while true, that is separate from this bug report ;) )

I'm submitting a...

  • [ X ] bug report

What is the current behavior?

After creating an EKS node group with a launch template, updating the template causes a node group already exists failure

If this is a bug, how to reproduce? Please include a code sample if relevant.

Create an EKS node group with a custom launch template, using custom userdata

[...]
  node_groups = { 
    worker = { 
      desired_capacity = 2 
      max_capacity     = 6 
      min_capacity     = 1 
      instance_type    = "r5.xlarge"
      additional_tags  = local.common_tags
      launch_template_id = aws_launch_template.default.id
      launch_template_version = aws_launch_template.default.default_version
    }   
  }
[...]
resource "aws_launch_template" "default" {
  name_prefix            = "${var.cluster_name}"
  description            = "Default Launch-Template for clusters"
  update_default_version = true
  key_name               = data.aws_ssm_parameter.keypair.value
  user_data              = base64encode(replace(data.cloudinit_config.node_group.rendered, "MIMEBOUNDARY", "//"))
  ebs_optimized          = true
  instance_type          = "r5.xlarge"
[...]

Update the user data being consumed by the launch template.
On apply, terraform will instruct AWS to update the launch template in place, creating a new version of the template.

tf will also instruct AWS to create a new node_group using the updated launch template version, but will fail with this error:

Error: error creating EKS Node Group (uccdev-west:uccdev-west-worker-magnetic-puma): ResourceInUseException: NodeGroup already exists with name uccdev-west-worker-magnetic-puma and cluster name uccdev-west

This is caused by the configuration of the random_pet for the node_group, which currently does have the launch template version as one of its keepers.

What's the expected behavior?

The random_pet name should be updated, then a new node group will be created, EKS will migrate all services to the new node group and shut down the old group

Are you able to fix this problem and submit a PR? Link here if you have already.

In this file: https://github.com/terraform-aws-modules/terraform-aws-eks/blob/master/modules/node_groups/random.tf#L20
Add the following as line 21:

launch_template_version = lookup(each.value, "launch_template_version", null)

Environment details

  • Affected module version: 13.2.1
  • OS: MacOS 10.15.7
  • Terraform version: v0.14.0

Any other relevant info

@jcam
Copy link
Author

jcam commented Dec 20, 2020

After updating the userdata again, I now looked back and the actual trigger of the reload was a change in the node type:

  # module.eks.module.node_groups.aws_eks_node_group.workers["worker"] must be replaced
+/- resource "aws_eks_node_group" "workers" {
      ~ ami_type        = "AL2_x86_64" -> (known after apply)
      ~ arn             = "arn:aws:eks:us-west-2:363865594249:nodegroup/uccdev-west/uccdev-west-worker-magnetic-puma/08bb3e64-9672-c685-3444-0e776cd0a017" -> (known after apply)
      ~ capacity_type   = "ON_DEMAND" -> (known after apply)
      ~ disk_size       = 0 -> (known after apply)
      ~ id              = "uccdev-west:uccdev-west-worker-magnetic-puma" -> (known after apply)
      ~ instance_types  = [ # forces replacement
          - "t3.medium",
        ]
      - labels          = {} -> null
      ~ release_version = "1.18.9-20201211" -> (known after apply)
      ~ resources       = [
          - {
              - autoscaling_groups              = [
                  - {
                      - name = "eks-08bb3e64-9672-c685-3444-0e776cd0a017"
                    },
                ]
              - remote_access_security_group_id = ""
            },
        ] -> (known after apply)
      ~ status          = "ACTIVE" -> (known after apply)
        tags            = {
            "Application"      = "EKS"
            "Environment"      = "UCCDEV"
            "GitRepoName"      = "sre-eks-terraform"
            "GitRepoPath"      = "terraform/eks-cluster"
            "ManagementMethod" = "tfstack-v0.44.1"
            "Name"             = "eks-cluster"
            "Owner"            = "jcampbell@mms.org"
            "Product"          = "SRE-Infrastructure"
            "Purpose"          = "Kubernetes Container Hosting"
            "Region"           = "us-west-2"
        }
      ~ version         = "1.18" -> (known after apply)
        # (4 unchanged attributes hidden)

      ~ launch_template {
            id      = "lt-024139de7893b8632"
          ~ name    = "uccdev-west-20201219201305265300000001" -> (known after apply)
          ~ version = "5" -> "6"
        }

        # (1 unchanged block hidden)
    }

Plan: 1 to add, 0 to change, 1 to destroy.

@surajnarwade
Copy link

Hi @jcam , I am facing the same issue with managed node groups, I tried adding:

launch_template_version = lookup(each.value, "launch_template_version", null)

as you mentioned, but it's not working, it's still adding recreating node group for me.

@pre
Copy link

pre commented Jan 20, 2021

I'm failing to understand the correct behaviour expected by the original description in this issue (above).

In my mind - the Managed Node Group must not be recreated when the Launch Template is updated.

The Managed Node Group is using a certain version of a Launch Template. When that Launch Template is updated, a new version of the same Launch Template will be available (same Launch Template id).

As a result, the existing Managed Node Group can be updated to use the new version of the same Launch Template.
Updating the version number bound to a certain Managed Node Group may happen at the time when the Launch Template is updated or later when the bound version is changed (as a separate action).

The crucial point is: The Managed Node Group should NOT be recreated when a Launch Template is updated.

Instead, the new version of the same Launch Template will be made available to the Managed Node Group, but it is a separate decision to update the Managed Node Group use this new Launch Template version.

@pre
Copy link

pre commented Jan 20, 2021

Btw, check out the following #1109 (comment)

You might be experiencing the same issue, but having different symptoms.

TL;DR avoid random_pet

@jcam
Copy link
Author

jcam commented Jan 20, 2021

I agree the existing node group should be updated, but that is not what terraform tries to do.

@stale
Copy link

stale bot commented Apr 20, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Apr 20, 2021
@cbui
Copy link

cbui commented May 5, 2021

I think #1138 fixes this.

@stale stale bot removed the stale label May 5, 2021
@mc-meta
Copy link

mc-meta commented May 11, 2021

Still hitting this with v15.2.0 that AFAICT is including #1138. In my case, previous state was coming from a run with a private copy of the module that was just including a prerelease of #1138 and the replacement is forced by:

 ~ instance_types  = [ # forces replacement
      + "r5.large",
    ]

The error is still:

Error: error creating EKS Node Group (xxxxxxxxxxxxxxx-1-apps-mng-1-on-cicada): ResourceInUseException: NodeGroup already exists with name xxxxxxxxxxxxxxxx-1-apps-mng-1-on-cicada and cluster name xxxxxxxxxxxxx

HTH

@olegcoreq
Copy link

Yes, facing this also on 15.2

@Chinikins
Copy link

Can confirm this is still a problem on 15.2

@barryib
Copy link
Member

barryib commented May 19, 2021

@Chinikins @olegcoreq @mc-meta is #1372 resolves this issue ?

@mc-meta
Copy link

mc-meta commented May 22, 2021

Hello @barryib,

i've replicated cluster state as in my previous comment, and in my case, while v15.2.0 and v16.1.0 are still failing with same error, code from #1372 seems to be working fine and MNGs are recreated as expected without errors.

Couple of attention points:

  • Had to bump aws provider to version >= 3.40.0 since required 3.37.0 was not providing argument node_group_name_prefix in resource aws_eks_node_group:

    https://github.com/hashicorp/terraform-provider-aws/releases/tag/v3.40.0

  • Had to shorten node_group names and retest, since new generated nodeGroupName seems longer than the one from current code. In my cluster that was overflowing max length:

    Error: error creating EKS Node Group (xxxx-xxxx-xxxx-xxxx/xxxx-xxxx-xxxx-xxxx-xxxx-xxxx-xxx-1-apps-mng-220210520210054766200000017): InvalidParameterException: nodeGroupName can't be longer than 63 characters!
    {
      RespMetadata: {
        StatusCode: 400,
        RequestID: "3e52fc51-22fb-48f8-b524-673e387390c4"
      },
      Message_: "nodeGroupName can't be longer than 63 characters!"
    }
    
    

HTH

@Chinikins
Copy link

also confirming that if I try with #1372 and bumping up the aws provider, this fixes the issue for me.

@barryib
Copy link
Member

barryib commented May 27, 2021

Thank you for your feedback. Just bumped required version.

@cabrinha
Copy link
Contributor

Hello @barryib,

i've replicated cluster state as in my previous comment, and in my case, while v15.2.0 and v16.1.0 are still failing with same error, code from #1372 seems to be working fine and MNGs are recreated as expected without errors.

Couple of attention points:

* Had to bump aws provider to version >= 3.40.0 since required 3.37.0 was not providing argument `node_group_name_prefix` in resource `aws_eks_node_group`:
  https://github.com/hashicorp/terraform-provider-aws/releases/tag/v3.40.0

* Had to shorten node_group names and retest, since new generated `nodeGroupName` seems longer than the one from current code. In my cluster that was overflowing max length:
  ```
  Error: error creating EKS Node Group (xxxx-xxxx-xxxx-xxxx/xxxx-xxxx-xxxx-xxxx-xxxx-xxxx-xxx-1-apps-mng-220210520210054766200000017): InvalidParameterException: nodeGroupName can't be longer than 63 characters!
  {
    RespMetadata: {
      StatusCode: 400,
      RequestID: "3e52fc51-22fb-48f8-b524-673e387390c4"
    },
    Message_: "nodeGroupName can't be longer than 63 characters!"
  }
  ```

HTH

How do we fix the max length issue? Would be nice to truncate the name somewhere...

@maxbrunet
Copy link
Contributor

I think we could drop the cluster_name bit from the node group name:

node_groups_names = { for k, v in local.node_groups_expanded : k => lookup(
v,
"name",
lookup(
v,
"name_prefix",
join("-", [var.cluster_name, k])
)
) }

Node groups are already namespaced under the cluster:

arn:aws:eks:<region>:<accountId>:nodegroup/<clusterName>/<nodeGroupName>/<uuid>

As workaround, name_prefix can be overridden in the node group configuration:

node_groups = {
  foobar = {
    name_prefix = "foobar"

    # ...
  }
}

@ShadySQL
Copy link

Bumped into this issue using this provider:

terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "3.51.0"
}

I don't think updating the provider would resolve this issue.

@ikarlashov
Copy link

@barryib
Hey Barry, why is it still closed? Please reopen the issue. I'm facing it with the latest 3.64.2 provider.

@daroga0002
Copy link
Contributor

Hey Barry, why is it still closed? Please reopen the issue. I'm facing it with the latest 3.64.2 provider.

which version of module you are using?

@ikarlashov
Copy link

@daroga0002 It seems to be 17.1.0.

@daroga0002
Copy link
Contributor

Please update to latest as there was multiple improvements in that area and then share a error which you are getting

@danparsons
Copy link

danparsons commented Nov 11, 2021

I, too, am having the exact same problem with eks version 17.23.0, the latest release available. Also using the latest aws provider, v3.64.2. My case is (perhaps) a little different - I'm adding key_name and source_security_group_ids to my node_groups - but it's the same situation, it's forcing terraform to delete/recreate instead of updating in place.

@danparsons
Copy link

danparsons commented Nov 11, 2021

In the end I had to create new nodegroups, drain the old nodegroups, delete them via eksctl, remove them from my .tf files, and then the next terraform apply deleted and recreated the new nodegroups, but finally accepted them as "terraform managed" and subsequent terraform plans were clean. No amount of terraform state importing or moving or editing was able to get me around terraform wanting to delete all my nodegroups and recreate them.

I finally let it go, even though there was a 5 minute outage while there were 0 worker nodes, because we could take the outage, and it was faster than digging through this issue further.

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 16, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.