Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update a launch template of a managed node group without recreating that managed node group #1109

Closed
1 task done
pre opened this issue Nov 19, 2020 · 7 comments · Fixed by #1372
Closed
1 task done

Comments

@pre
Copy link

pre commented Nov 19, 2020

I have issues with updating a launch template of a Managed Node Group

I'm submitting a...

  • bug report

What is the current behavior?

Practically any change to a Launch Template will force recreation of a Managed Node Group.

If this is a bug, how to reproduce? Please include a code sample if relevant.

What's the expected behavior?

  • Changing the Launch Template happens in-place (as it currently does), this works and Launch Template gets a new version number.

  • Existing Managed Node Group is updated to the new Launch Template version number:

      launch_template_id      = aws_launch_template.ceph_osd.id
      launch_template_version = aws_launch_template.ceph_osd.default_version
  • Existing Managed Node Group is not replaced with a new resource. Instead the existing MNG is kept.

  • Something causes ami_type , disk_size and node_group_name to change and force a replacement. I have tried setting them explicitly in node_groups_defaults, or in the node_group but it does not have an effect. I have tried leaving them out - no effect, still trying to recreate the MNG.

Related

Environment details

  • Affected module version: 13.2.0
  • OS: MacOS
  • Terraform version:
Terraform v0.13.5
+ provider registry.terraform.io/hashicorp/aws v3.15.0
+ provider registry.terraform.io/hashicorp/kubernetes v1.13.3
+ provider registry.terraform.io/hashicorp/local v1.4.0
+ provider registry.terraform.io/hashicorp/null v2.1.2
+ provider registry.terraform.io/hashicorp/random v3.0.0
+ provider registry.terraform.io/hashicorp/template v2.2.0

Any other relevant info

Terraform output

  # module.eks.module.node_groups.aws_eks_node_group.workers["example"] must be replaced
+/- resource "aws_eks_node_group" "workers" {
      ~ ami_type        = "AL2_x86_64" -> (known after apply) # forces replacement
      ~ arn             = "arn:aws:eks:eu-central-1:161325215566:nodegroup/aws-eu-central-1-2/aws-eu-central-1-2-example/62baf003-22ef-a6e1-a4bc-ee8e1eda4d27" -> (known after apply)
        cluster_name    = "aws-eu-central-1-2"
      ~ disk_size       = 0 -> (known after apply) # forces replacement
      ~ id              = "aws-eu-central-1-2:aws-eu-central-1-2-example" -> (known after apply)
        instance_types  = []
      ~ labels          = {} -> (known after apply)
      ~ node_group_name = "aws-eu-central-1-2-example" -> (known after apply) # forces replacement
        node_role_arn   = "arn:aws:iam::161325215566:role/aws-eu-central-1-22020111907234856210000000a"
      ~ release_version = "1.18.9-20201117" -> (known after apply)
      ~ resources       = [
          - {
              - autoscaling_groups              = [
                  - {
                      - name = "eks-62baf003-22ef-a6e1-a4bc-ee8e1eda4d27"
                    },
                ]
              - remote_access_security_group_id = ""
            },
        ] -> (known after apply)
      ~ status          = "ACTIVE" -> (known after apply)
        subnet_ids      = [
            "subnet-0a97ca5e61048b727",
        ]
      ~ tags            = {
          - "CephRole" = "mon"
          - "NodeRole" = "storage"
          - "az"       = "eu-central-1c"
        } -> (known after apply)
      ~ version         = "1.18" -> (known after apply)

      ~ launch_template {
            id      = "lt-0054877e118c06bfc"
          ~ name    = "aws-eu-central-1-2-example" -> (known after apply)
          ~ version = "2" -> (known after apply)
        }

        scaling_config {
            desired_size = 1
            max_size     = 1
            min_size     = 1
        }
    }

  # module.eks.module.node_groups.random_pet.node_groups["example"] must be replaced
+/- resource "random_pet" "node_groups" {
      ~ id        = "capable-serval" -> (known after apply)
      ~ keepers   = {
          - "ami_type"                  = "AL2_x86_64"
          - "disk_size"                 = "0"
          - "iam_role_arn"              = "arn:aws:iam::161325215566:role/aws-eu-central-1-22020111907234856210000000a"
          - "instance_type"             = "t3.large"
          - "key_name"                  = ""
          - "launch_template"           = "lt-0054877e118c06bfc"
          - "node_group_name"           = "aws-eu-central-1-2-example"
          - "source_security_group_ids" = ""
          - "subnet_ids"                = "subnet-0a97ca5e61048b727"
        } -> (known after apply) # forces replacement
        length    = 2
        separator = "-"
    }
Node Group
node_groups = {
  example = {
    name             = "${var.cluster_name}-example"
    subnets          = [module.vpc.private_subnets[2]]
    desired_capacity = 1
    max_capacity     = 1
    min_capacity     = 1

    launch_template_id      = aws_launch_template.example.id
    instance_type           = aws_launch_template.example.instance_type
    launch_template_version = aws_launch_template.example.default_version

    additional_tags = {
      NodeRole = "storage"
      az       = "eu-central-1c"
    }
  }
}
Launch Template
# https://github.com/terraform-aws-modules/terraform-aws-eks/blob/v13.2.0/examples/launch_templates_with_managed_node_groups/launchtemplate.tf

data "template_file" "launch_template_userdata_mon" {
  template = file("${path.module}/templates/userdata.sh.tpl")

  vars = {
    cluster_name        = var.cluster_name
    endpoint            = module.eks.cluster_endpoint
    cluster_auth_base64 = module.eks.cluster_certificate_authority_data
  }
}

resource "aws_launch_template" "example" {
  name                   = "${var.cluster_name}-example"
  description            = "Example Launch-Template"
  update_default_version = true

  user_data = base64encode(
    data.template_file.launch_template_userdata_mon.rendered
  )

  instance_type = "t3.large"

  block_device_mappings {
    device_name = "/dev/xvda"

    ebs {
      volume_size           = 25
      volume_type           = "gp2"
      delete_on_termination = true
    }
  }

  monitoring {
    enabled = true
  }

  network_interfaces {
    associate_public_ip_address = false
    delete_on_termination       = true
    security_groups             = [module.eks.worker_security_group_id]
  }

  tag_specifications {
    resource_type = "instance"

    tags = {
      NodeRole = "storage"
    }
  }

  tag_specifications {
    resource_type = "volume"

    tags = {
      NodeRole = "storage"
    }
  }

  tags = {
    NodeRole = "storage"
  }

  lifecycle {
    create_before_destroy = true
  }
}

@pre
Copy link
Author

pre commented Nov 19, 2020

I tried what happens if I remove instance_type from node_group as the instance type is defined by Launch Template:

  • No changes to the Managed Node Group (as expected)
  • But: module.eks.module.node_groups.random_pet.node_groups["example"] must be replaced

Maybe this is a clue? I don't understand what purpose the module.eks.module.node_groups.random_pet.node_groups["example"] serves in the first place as it contains attributes which belong to the Launch Template. Also, as I have defined explicit name the random_pet is still created - is there a way to avoid the random_pet?

Terraform will perform the following actions:

  # module.eks.module.node_groups.random_pet.node_groups["example"] must be replaced
+/- resource "random_pet" "node_groups" {
      ~ id        = "amazed-seasnail" -> (known after apply)
      ~ keepers   = { # forces replacement
            "ami_type"                  = "AL2_x86_64"
            "disk_size"                 = "0"
            "iam_role_arn"              = "arn:aws:iam::161325215566:role/aws-eu-central-1-22020111907234856210000000a"
          ~ "instance_type"             = "t3.large" -> "m4.large"
            "key_name"                  = ""
            "launch_template"           = "lt-0054877e118c06bfc"
            "node_group_name"           = "aws-eu-central-1-2-example"
            "source_security_group_ids" = ""
            "subnet_ids"                = "subnet-0a97ca5e61048b727"
        }
        length    = 2
        separator = "-"
    }

Plan: 1 to add, 0 to change, 1 to destroy.

@pre pre changed the title How to update a launch template of a managed node group Update a launch template of a managed node group without recreating the managed node group Nov 19, 2020
@pre pre changed the title Update a launch template of a managed node group without recreating the managed node group Update a launch template of a managed node group without recreating that managed node group Nov 19, 2020
@pre
Copy link
Author

pre commented Jan 20, 2021

I finally took a deeper look into this issue.

Learnings:

  • There exists only one single link between Managed Node Groups and Random Pet and its in node_group_name

  • You may define an explicit node_group name and it will be used.

  • When you use an explicit node_group name the random_pet resource for that Node Group is still created ( module.eks.module.node_groups.random_pet.node_groups["test-workers"] ) but it is not used for anything

  • When the random_pet resource is created, changing any attribute in a Launch Template used in a Node Group will result in recreating this Node Group.

=> RECREATING the Node Group is incorrect behaviour.

Changing the Launch Template will create a new version of that Launch Template. However, updating the Managed Node Group use this new version of the Launch Template is a separate action.

Managed Node Group can have either:

  • launch_template_version = 1 # explicit version number
  • aws_launch_template.worker_lolcat.default_version # Node Group always uses the latest version

Launch Template versions are meant to be used as a managed way to distribute updates to Managed Node Groups. If the Managed Node Group is recreated from scratch, rather than updated in-place, it defeats the purpose.

Workaround - The random_pet resource is harmful

Issue can be fixed as follows:

  • Define an explicit name for your node_group (so random_pet is not used for anything in node_group)
  node_groups = {
    test-workers = {
      name                    = "${var.cluster_name}-test-workers"
      # ..
    }
}

TL;DR With explicit node_group name, modules/node_groups/random.tf is not used for anything, but its mere existence causes this whole issue.

Avoid random_pet - it is a nice helper for starting up easily but it will only cause debug pains in a live environment.

@pinkavaj
Copy link

pinkavaj commented Feb 28, 2021

This is particullary issue when Terraform is applied from Linux and the from Windows machine (or vice versa), because userdata.sh.tpl differs because of the different newlines ...

@ngocketit
Copy link

This is indeed a pain for production use. @pre How did you manage to delete random.tf?

@pre
Copy link
Author

pre commented Apr 5, 2021

This is indeed a pain for production use. @pre How did you manage to delete random.tf?

I forked this module and stopped dreaming about community maintained terraform.

@drunkirishcoder
Copy link

we just ran into this problem as well. is there plan to fix this in the module?

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 21, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
4 participants