EKS Node Group fails to recreate when using launch template, on minor template update #1152

jcam · 2020-12-19T22:35:45Z

I have issues (while true, that is separate from this bug report ;) )

I'm submitting a...

[ X ] bug report

What is the current behavior?

After creating an EKS node group with a launch template, updating the template causes a node group already exists failure

If this is a bug, how to reproduce? Please include a code sample if relevant.

Create an EKS node group with a custom launch template, using custom userdata

[...]
  node_groups = { 
    worker = { 
      desired_capacity = 2 
      max_capacity     = 6 
      min_capacity     = 1 
      instance_type    = "r5.xlarge"
      additional_tags  = local.common_tags
      launch_template_id = aws_launch_template.default.id
      launch_template_version = aws_launch_template.default.default_version
    }   
  }
[...]
resource "aws_launch_template" "default" {
  name_prefix            = "${var.cluster_name}"
  description            = "Default Launch-Template for clusters"
  update_default_version = true
  key_name               = data.aws_ssm_parameter.keypair.value
  user_data              = base64encode(replace(data.cloudinit_config.node_group.rendered, "MIMEBOUNDARY", "//"))
  ebs_optimized          = true
  instance_type          = "r5.xlarge"
[...]

Update the user data being consumed by the launch template.
On apply, terraform will instruct AWS to update the launch template in place, creating a new version of the template.

tf will also instruct AWS to create a new node_group using the updated launch template version, but will fail with this error:

Error: error creating EKS Node Group (uccdev-west:uccdev-west-worker-magnetic-puma): ResourceInUseException: NodeGroup already exists with name uccdev-west-worker-magnetic-puma and cluster name uccdev-west

This is caused by the configuration of the random_pet for the node_group, which currently does have the launch template version as one of its keepers.

What's the expected behavior?

The random_pet name should be updated, then a new node group will be created, EKS will migrate all services to the new node group and shut down the old group

Are you able to fix this problem and submit a PR? Link here if you have already.

In this file: https://github.com/terraform-aws-modules/terraform-aws-eks/blob/master/modules/node_groups/random.tf#L20
Add the following as line 21:

launch_template_version = lookup(each.value, "launch_template_version", null)

Environment details

Affected module version: 13.2.1
OS: MacOS 10.15.7
Terraform version: v0.14.0

Any other relevant info

The text was updated successfully, but these errors were encountered:

jcam · 2020-12-20T00:50:24Z

After updating the userdata again, I now looked back and the actual trigger of the reload was a change in the node type:

  # module.eks.module.node_groups.aws_eks_node_group.workers["worker"] must be replaced
+/- resource "aws_eks_node_group" "workers" {
      ~ ami_type        = "AL2_x86_64" -> (known after apply)
      ~ arn             = "arn:aws:eks:us-west-2:363865594249:nodegroup/uccdev-west/uccdev-west-worker-magnetic-puma/08bb3e64-9672-c685-3444-0e776cd0a017" -> (known after apply)
      ~ capacity_type   = "ON_DEMAND" -> (known after apply)
      ~ disk_size       = 0 -> (known after apply)
      ~ id              = "uccdev-west:uccdev-west-worker-magnetic-puma" -> (known after apply)
      ~ instance_types  = [ # forces replacement
          - "t3.medium",
        ]
      - labels          = {} -> null
      ~ release_version = "1.18.9-20201211" -> (known after apply)
      ~ resources       = [
          - {
              - autoscaling_groups              = [
                  - {
                      - name = "eks-08bb3e64-9672-c685-3444-0e776cd0a017"
                    },
                ]
              - remote_access_security_group_id = ""
            },
        ] -> (known after apply)
      ~ status          = "ACTIVE" -> (known after apply)
        tags            = {
            "Application"      = "EKS"
            "Environment"      = "UCCDEV"
            "GitRepoName"      = "sre-eks-terraform"
            "GitRepoPath"      = "terraform/eks-cluster"
            "ManagementMethod" = "tfstack-v0.44.1"
            "Name"             = "eks-cluster"
            "Owner"            = "jcampbell@mms.org"
            "Product"          = "SRE-Infrastructure"
            "Purpose"          = "Kubernetes Container Hosting"
            "Region"           = "us-west-2"
        }
      ~ version         = "1.18" -> (known after apply)
        # (4 unchanged attributes hidden)

      ~ launch_template {
            id      = "lt-024139de7893b8632"
          ~ name    = "uccdev-west-20201219201305265300000001" -> (known after apply)
          ~ version = "5" -> "6"
        }

        # (1 unchanged block hidden)
    }

Plan: 1 to add, 0 to change, 1 to destroy.

surajnarwade · 2021-01-15T16:03:13Z

Hi @jcam , I am facing the same issue with managed node groups, I tried adding:

launch_template_version = lookup(each.value, "launch_template_version", null)

as you mentioned, but it's not working, it's still adding recreating node group for me.

pre · 2021-01-20T08:51:40Z

I'm failing to understand the correct behaviour expected by the original description in this issue (above).

In my mind - the Managed Node Group must not be recreated when the Launch Template is updated.

The Managed Node Group is using a certain version of a Launch Template. When that Launch Template is updated, a new version of the same Launch Template will be available (same Launch Template id).

As a result, the existing Managed Node Group can be updated to use the new version of the same Launch Template.
Updating the version number bound to a certain Managed Node Group may happen at the time when the Launch Template is updated or later when the bound version is changed (as a separate action).

The crucial point is: The Managed Node Group should NOT be recreated when a Launch Template is updated.

Instead, the new version of the same Launch Template will be made available to the Managed Node Group, but it is a separate decision to update the Managed Node Group use this new Launch Template version.

pre · 2021-01-20T09:03:59Z

Btw, check out the following #1109 (comment)

You might be experiencing the same issue, but having different symptoms.

TL;DR avoid random_pet

jcam · 2021-01-20T14:23:55Z

I agree the existing node group should be updated, but that is not what terraform tries to do.

stale · 2021-04-20T15:15:58Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

cbui · 2021-05-05T12:47:59Z

I think #1138 fixes this.

mc-meta · 2021-05-11T14:34:26Z

Still hitting this with v15.2.0 that AFAICT is including #1138. In my case, previous state was coming from a run with a private copy of the module that was just including a prerelease of #1138 and the replacement is forced by:

 ~ instance_types  = [ # forces replacement
      + "r5.large",
    ]

The error is still:

Error: error creating EKS Node Group (xxxxxxxxxxxxxxx-1-apps-mng-1-on-cicada): ResourceInUseException: NodeGroup already exists with name xxxxxxxxxxxxxxxx-1-apps-mng-1-on-cicada and cluster name xxxxxxxxxxxxx

HTH

olegcoreq · 2021-05-17T16:36:07Z

Yes, facing this also on 15.2

Chinikins · 2021-05-18T23:02:50Z

Can confirm this is still a problem on 15.2

barryib · 2021-05-19T09:16:01Z

@Chinikins @olegcoreq @mc-meta is #1372 resolves this issue ?

mc-meta · 2021-05-22T01:40:45Z

Hello @barryib,

i've replicated cluster state as in my previous comment, and in my case, while v15.2.0 and v16.1.0 are still failing with same error, code from #1372 seems to be working fine and MNGs are recreated as expected without errors.

Couple of attention points:

Had to bump aws provider to version >= 3.40.0 since required 3.37.0 was not providing argument node_group_name_prefix in resource aws_eks_node_group:

https://github.com/hashicorp/terraform-provider-aws/releases/tag/v3.40.0

Had to shorten node_group names and retest, since new generated nodeGroupName seems longer than the one from current code. In my cluster that was overflowing max length:

Error: error creating EKS Node Group (xxxx-xxxx-xxxx-xxxx/xxxx-xxxx-xxxx-xxxx-xxxx-xxxx-xxx-1-apps-mng-220210520210054766200000017): InvalidParameterException: nodeGroupName can't be longer than 63 characters!
{
  RespMetadata: {
    StatusCode: 400,
    RequestID: "3e52fc51-22fb-48f8-b524-673e387390c4"
  },
  Message_: "nodeGroupName can't be longer than 63 characters!"
}

HTH

Chinikins · 2021-05-27T19:08:33Z

also confirming that if I try with #1372 and bumping up the aws provider, this fixes the issue for me.

barryib · 2021-05-27T19:28:33Z

Thank you for your feedback. Just bumped required version.

cabrinha · 2021-06-23T18:20:49Z

Hello @barryib,

i've replicated cluster state as in my previous comment, and in my case, while v15.2.0 and v16.1.0 are still failing with same error, code from #1372 seems to be working fine and MNGs are recreated as expected without errors.

Couple of attention points:

* Had to bump aws provider to version >= 3.40.0 since required 3.37.0 was not providing argument `node_group_name_prefix` in resource `aws_eks_node_group`:
  https://github.com/hashicorp/terraform-provider-aws/releases/tag/v3.40.0

* Had to shorten node_group names and retest, since new generated `nodeGroupName` seems longer than the one from current code. In my cluster that was overflowing max length:
  ```
  Error: error creating EKS Node Group (xxxx-xxxx-xxxx-xxxx/xxxx-xxxx-xxxx-xxxx-xxxx-xxxx-xxx-1-apps-mng-220210520210054766200000017): InvalidParameterException: nodeGroupName can't be longer than 63 characters!
  {
    RespMetadata: {
      StatusCode: 400,
      RequestID: "3e52fc51-22fb-48f8-b524-673e387390c4"
    },
    Message_: "nodeGroupName can't be longer than 63 characters!"
  }
  ```

HTH

How do we fix the max length issue? Would be nice to truncate the name somewhere...

maxbrunet · 2021-07-15T23:45:40Z

I think we could drop the cluster_name bit from the node group name:

terraform-aws-eks/modules/node_groups/locals.tf

Lines 30 to 38 in 9022013

    
           node_groups_names = { for k, v in local.node_groups_expanded : k => lookup( 
        
             v, 
        
             "name", 
        
             lookup( 
        
               v, 
        
               "name_prefix", 
        
               join("-", [var.cluster_name, k]) 
        
             ) 
        
           ) }

Node groups are already namespaced under the cluster:

arn:aws:eks:<region>:<accountId>:nodegroup/<clusterName>/<nodeGroupName>/<uuid>

As workaround, name_prefix can be overridden in the node group configuration:

node_groups = {
  foobar = {
    name_prefix = "foobar"

    # ...
  }
}

ShadySQL · 2021-10-22T21:43:00Z

Bumped into this issue using this provider:

terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "3.51.0"
}

I don't think updating the provider would resolve this issue.

ikarlashov · 2021-11-09T10:26:20Z

@barryib
Hey Barry, why is it still closed? Please reopen the issue. I'm facing it with the latest 3.64.2 provider.

daroga0002 · 2021-11-09T11:19:47Z

Hey Barry, why is it still closed? Please reopen the issue. I'm facing it with the latest 3.64.2 provider.

which version of module you are using?

ikarlashov · 2021-11-09T11:22:14Z

@daroga0002 It seems to be 17.1.0.

daroga0002 · 2021-11-09T11:23:41Z

Please update to latest as there was multiple improvements in that area and then share a error which you are getting

danparsons · 2021-11-11T00:17:35Z

I, too, am having the exact same problem with eks version 17.23.0, the latest release available. Also using the latest aws provider, v3.64.2. My case is (perhaps) a little different - I'm adding key_name and source_security_group_ids to my node_groups - but it's the same situation, it's forcing terraform to delete/recreate instead of updating in place.

danparsons · 2021-11-11T03:14:57Z

In the end I had to create new nodegroups, drain the old nodegroups, delete them via eksctl, remove them from my .tf files, and then the next terraform apply deleted and recreated the new nodegroups, but finally accepted them as "terraform managed" and subsequent terraform plans were clean. No amount of terraform state importing or moving or editing was able to get me around terraform wanting to delete all my nodegroups and recreate them.

I finally let it go, even though there was a 5 minute outage while there were 0 worker nodes, because we could take the outage, and it was faster than digging through this issue further.

github-actions · 2022-11-16T02:30:06Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

dresnick-sf mentioned this issue Jan 21, 2021

Apply fails to replace a node group (for adding a launch template) with "NodeGroup already exists with name internal-t2_medium and cluster name development-pod01" #1191

Closed

4 tasks

stale bot added the stale label Apr 20, 2021

stale bot removed the stale label May 5, 2021

barryib mentioned this issue May 19, 2021

feat: Drop random pets from Managed Node Groups #1372

Merged

2 tasks

barryib closed this as completed in #1372 May 27, 2021

github-actions bot locked as resolved and limited conversation to collaborators Nov 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EKS Node Group fails to recreate when using launch template, on minor template update #1152

EKS Node Group fails to recreate when using launch template, on minor template update #1152

jcam commented Dec 19, 2020

jcam commented Dec 20, 2020

surajnarwade commented Jan 15, 2021

pre commented Jan 20, 2021

pre commented Jan 20, 2021

jcam commented Jan 20, 2021

stale bot commented Apr 20, 2021

cbui commented May 5, 2021

mc-meta commented May 11, 2021 •

edited

Loading

olegcoreq commented May 17, 2021

Chinikins commented May 18, 2021

barryib commented May 19, 2021 •

edited

Loading

mc-meta commented May 22, 2021 •

edited

Loading

Chinikins commented May 27, 2021

barryib commented May 27, 2021

cabrinha commented Jun 23, 2021

maxbrunet commented Jul 15, 2021

ShadySQL commented Oct 22, 2021

ikarlashov commented Nov 9, 2021

daroga0002 commented Nov 9, 2021

ikarlashov commented Nov 9, 2021

daroga0002 commented Nov 9, 2021

danparsons commented Nov 11, 2021 •

edited

Loading

danparsons commented Nov 11, 2021 •

edited

Loading

github-actions bot commented Nov 16, 2022

EKS Node Group fails to recreate when using launch template, on minor template update #1152

EKS Node Group fails to recreate when using launch template, on minor template update #1152

Comments

jcam commented Dec 19, 2020

I have issues (while true, that is separate from this bug report ;) )

I'm submitting a...

What is the current behavior?

If this is a bug, how to reproduce? Please include a code sample if relevant.

What's the expected behavior?

Are you able to fix this problem and submit a PR? Link here if you have already.

Environment details

Any other relevant info

jcam commented Dec 20, 2020

surajnarwade commented Jan 15, 2021

pre commented Jan 20, 2021

pre commented Jan 20, 2021

jcam commented Jan 20, 2021

stale bot commented Apr 20, 2021

cbui commented May 5, 2021

mc-meta commented May 11, 2021 • edited Loading

olegcoreq commented May 17, 2021

Chinikins commented May 18, 2021

barryib commented May 19, 2021 • edited Loading

mc-meta commented May 22, 2021 • edited Loading

Chinikins commented May 27, 2021

barryib commented May 27, 2021

cabrinha commented Jun 23, 2021

maxbrunet commented Jul 15, 2021

ShadySQL commented Oct 22, 2021

ikarlashov commented Nov 9, 2021

daroga0002 commented Nov 9, 2021

ikarlashov commented Nov 9, 2021

daroga0002 commented Nov 9, 2021

danparsons commented Nov 11, 2021 • edited Loading

danparsons commented Nov 11, 2021 • edited Loading

github-actions bot commented Nov 16, 2022

mc-meta commented May 11, 2021 •

edited

Loading

barryib commented May 19, 2021 •

edited

Loading

mc-meta commented May 22, 2021 •

edited

Loading

danparsons commented Nov 11, 2021 •

edited

Loading

danparsons commented Nov 11, 2021 •

edited

Loading