Cannot Create a New Cluster w/ 17.21.0 #1635

onelapahead · 2021-10-12T16:04:28Z

Description

The module fails to create new clusters bc its trying to lookup an EKS cluster that does not exist by default.

⚠️ Note

Before you submit an issue, please perform the following first:

Remove the local .terraform directory (! ONLY if state is stored remotely, which hopefully you are following that best practice!): rm -rf .terraform/
Re-initialize the project root to pull down modules: terraform init
Re-attempt your terraform plan or apply and check if the issue still persists

Versions

Terraform: v1.0.7

Provider(s):
hashicorp/aws v3.62.0
Module: v17.21.0

Reproduction

Steps to reproduce the behavior:

Attempt to create an EKS cluster w/ node groups. For reference I'm using region ap-southeast-2.

Code Snippet to Reproduce

module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  cluster_name    = local.cluster_name
  cluster_version = "1.21"
  subnets         = module.vpc.private_subnets
  vpc_id          = module.vpc.vpc_id

  node_groups = {
    workers = {
      instance_types = [var.worker_instance_type]
      iam_role_arn = aws_iam_role.workers.arn
      desired_capacity = var.min_capacity
      max_capacity     = var.max_capacity
      min_capacity     = var.min_capacity

      launch_template_id      = aws_launch_template.default.id
      launch_template_version = aws_launch_template.default.default_version

      additional_tags = local.common_tags

      depends_on = [
        aws_iam_role_policy_attachment.workers-AmazonEKSWorkerNodePolicy,
        aws_iam_role_policy_attachment.workers-AmazonEKS_CNI_Policy,
        aws_iam_role_policy_attachment.workers-AmazonEC2ContainerRegistryReadOnly,
      ]
    }
  }
}

Expected behavior

I can use this module to create an EKS cluster w/ node groups

Actual behavior

It's failing to create a cluster w/ node groups bc it cannot find a default EKS cluster.

Terminal Output Screenshot(s)

Additional context

see #1633 (comment)

The text was updated successfully, but these errors were encountered:

daroga0002 · 2021-10-12T17:17:21Z

I am investigating this

daroga0002 · 2021-10-12T17:41:39Z

@bengesoff @hfuss I checked and seems issue is occuring not during cluster creation but rather during refresh phase when new cluster code is added to existing code.

Could you confirm?

daroga0002 · 2021-10-12T18:22:35Z

@stevehipwell this is related to data source eks_cluster which we introduced in your PR, this is used just to fetch

terraform-aws-eks/modules/node_groups/launch_template.tf

Lines 17 to 18 in 306ad72

    
           cluster_endpoint     = data.aws_eks_cluster.default[0].endpoint 
        
           cluster_ca           = data.aws_eks_cluster.default[0].certificate_authority[0].data

now I am looking into bootstrap.sh from AWS and seems those parameters are optional and can be fetched itself (if not setup) https://github.com/awslabs/amazon-eks-ami/blob/dbba9499841d3936d285bd2427f90ef0cdd385b3/files/bootstrap.sh#L20-L21

what now I thinking is just removing a data source and those two parameters from template.

Does there was any special reason why this should be fetched by terraform instead bootstrap.sh?

daroga0002 · 2021-10-12T19:03:08Z

Created draft PR to remove code causing issues, but this will require testing

onelapahead · 2021-10-12T19:05:34Z

@daroga0002 Sounds like you're on it, but just to follow up - that's not what I am observing. I am starting with an empty Terraform state and attempting to create a new VPC and EKS cluster.

Before it even gets started with a terraform apply it errors out bc the new code tries to lookup an EKS cluster that does not exist.

The patch I opened in #1636 allowed me to create a brand new cluster, so your suggestion of possibly removing the data lookup of the default cluster makes sense to me from what I understand.

stevehipwell · 2021-10-12T20:29:17Z

@hfuss #1636 only works because you're not using the functionality this enables and aren't setting var.create_eks = false, if either of those weren't the case you'd have errors.

@daroga0002 #1638 will only work for EKS optimised AMIs with bootstrap.sh and not for fully customised AMIs. ~~What are the rules on internal module variable changes and the module SemVer?~~ My original, and better tested, version passed local.cluster_endpoint & local.cluster_auth_base64 into the node_groups module; but I changed this to try and not change the variables in the sub module. I'll open a new PR to use this pattern. ~~If we can't add a new mandatory sub module variable we could have "" defaults as this usage wouldn't have worked before anyway.~~

stevehipwell · 2021-10-12T20:50:07Z

@daroga0002 I've created #1639 which should fix the issue and be a valid implementation of the pattern added in #1580.

daroga0002 · 2021-10-12T21:05:13Z

I thinked about this now again and in general currently we require CA and endpoint only when we using EKS optimized ami:

terraform-aws-eks/modules/node_groups/templates/userdata.sh.tpl

Line 2 in 306ad72

%{ if length(ami_id) == 0 ~}

but as EKS optimized ami has bootstrap.sh script it can download those info itself.

From second side when user will want to use non eks optimized ami then he can always pass this data thru pre_userdata.

One more thing which I see is probably second bug in template or maybe I dont see use case but

terraform-aws-eks/modules/node_groups/templates/userdata.sh.tpl

Line 2 in 306ad72

%{ if length(ami_id) == 0 ~}

means we will pass own AMI but we expecting always bootstrap.sh (so this is EKS optimized) and below we have

terraform-aws-eks/modules/node_groups/templates/userdata.sh.tpl

Line 16 in 306ad72

%{ if length(ami_id) > 0 && ami_is_eks_optimized ~}

stevehipwell · 2021-10-12T21:16:14Z

@daroga0002 I'll try and answer your points below.

It wouldn't be easy, if even possible, to pass the dynamic endpoint and CA into userdata for a non EKS optimised AMIs, which is why I added them as variables into the userdata template to be consumed by the custom userdata embedded below.

You are correct that bootstrap.sh on an EKS optimised AMI can download these values itself, but I would rather they were passed in as happens by default to limit the moving parts and so nothing that can be generated in advance is left until runtime and done on each node. This is what we do for the other worker userdata so is consistent (MNGs do this too behind the scenes).

As far as the template goes; the first block either modifies bootstrap.sh if we're getting a merged userdata because we didn't pass in an AMI ID, or it sets the variables without needing the modify bootstrap.sh to source them from another file. Then we add the custom userdata. Finally we call bootstrap.sh if we have an EKS optimised AMI that isn't going to merge the userdata. If you have a non-EKS optimised AMI your custom userdata needs to replicate bootstrap.sh or your node wont join the cluster.

stevehipwell · 2021-10-12T21:33:34Z

To clarify the basic principals of the changes in #1580. Firstly given 2 MNGs with the only difference being that one set an AMI ID, the resulting userdata would be functionally the same; so if the passed in AMI ID is the current default then you get the same outcome as if you hadn't passed it in. Secondly if it's not an EKS optimised AMI the required variables should be available to the custom bootstrap script. With this pattern I could take the current AWS EKS AMI ID and pass it into the module while setting it as not EKS optimised and then put a modified version of bootstrap.sh into the custom userdata to fix bugs such as awslabs/amazon-eks-ami#782 (at the cost of needing to manually update the AMI ID).

daroga0002 · 2021-10-13T07:43:35Z

@antonbabenko seems this is some bug which was not catched during testing as tests were done from scratch. This is isuse which is happenig during refreshing phase of terraform. I am thinking now with @stevehipwell in what way we can address this problem in best way.

QUestion is does we should mark somehow GitHub release as buggy in Changelog?

stevehipwell · 2021-10-13T07:51:26Z

@daroga0002 off the top of my head the following solutions to the buggy release are possible but not necessarily the only ones.

Delete the v17.21.0 tag
Replace the v17.21.0 tag after the code is fixed
Release the fix as v17.21.1 or v17.22.0 and update the release docs to show v17.21.0 is buggy

antonbabenko · 2021-10-13T07:51:49Z

@daroga0002 I am not sure what is the problem in detail because I was not following this discussion.

We need to release a bug fix as a minor release (17.22.0) and no need to highlight/mark something as buggy.

cmsmith7 · 2021-10-13T11:19:05Z

Hi we are also now in a predicament whereby we are unable to deploy new clusters .. typically the update happened during some of our upgrades and testing. Do you have an idea when we could expect this to be fixed? Many thanks.

stevehipwell · 2021-10-13T12:16:07Z

@cmsmith7 can you not just use the previous version?

rastakajakwanna · 2021-10-13T12:32:06Z

17.21.0 effectively blocks creating new clusters with managed node groups because eks_create is true by default (and that's correct because one wants to create the resources unless set otherwise).
As the count on the data source says if the above variable is true, then create the data resource, there is no conditional to actually depend on cluster creation (which it cannot because data resources have precedence), the data resource fails to gather information from not yet existing control plane and therefore it blocks new cluster creation.

v17.20.0 of the eks module is not blocking new cluster creation and #1639 looks like a proper fix to me.

amahpour · 2021-10-14T00:35:09Z

FYI, as a beginner to intermediate TF user I spent ~5 hours today trying to debug this issue. I, mistakenly, did not specify a version number when pulling this module (and verified that rolling back to 17.20.0 works perfectly). Additionally, this issue page hasn't been indexed yet on Google so searching for the output error will get you meek results (some referring to githubmemory.com which eventually got me here).

Just thought I'd add my two cents so we can make this PR a priority...

antonbabenko · 2021-10-14T07:38:00Z

@daroga0002 @stevehipwell Do you guys know when the fix will be ready?

stevehipwell · 2021-10-14T07:39:34Z

@antonbabenko I think @daroga0002 is just waiting on your review #1639 (review).

antonbabenko · 2021-10-14T07:51:19Z

@stevehipwell I have just merged it and released v17.22.0.

During the night I received even more emails from GitHub and didn't have a chance to read the one where Dawid mentioned me. Thank you very much for your contribution!

stevehipwell · 2021-10-14T07:54:06Z

@antonbabenko this one was on me so I'm glad I could help get it fixed. I'm still slightly unclear as to why using the data source failed in the way it did, but in hindsight the current solution was the one that I should have stuck with as it's significantly simpler to pass variables rather than re-fetch them.

daroga0002 · 2021-10-14T08:04:37Z

@antonbabenko this one was on me so I'm glad I could help get it fixed. I'm still slightly unclear as to why using the data source failed in the way it did, but in hindsight the current solution was the one that I should have stuck with as it's significantly simpler to pass variables rather than re-fetch them.

As per me this is some terraform bug with dependencies during refresh phase as I also tried to setup that data source with explicit dependency defining somthing like:

terraform-aws-eks/modules/node_groups/main.tf

Line 105 in 6b3a8e6

depends_on = [var.ng_depends_on]

but it was still failing

github-actions · 2022-11-17T02:27:35Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

onelapahead mentioned this issue Oct 12, 2021

Fixing Creating New Clusters #1636

Closed

1 task

bengesoff mentioned this issue Oct 12, 2021

Fresh cluster using managed node groups fails to plan with latest release #1637

Closed

daroga0002 self-assigned this Oct 12, 2021

daroga0002 mentioned this issue Oct 12, 2021

fix: remove node group aws_eks_cluster data source #1638

Closed

1 task

stevehipwell mentioned this issue Oct 12, 2021

fix: MNG cluster datasource errors #1639

Merged

1 task

antonbabenko closed this as completed in #1639 Oct 14, 2021

github-actions bot locked as resolved and limited conversation to collaborators Nov 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot Create a New Cluster w/ 17.21.0 #1635

Cannot Create a New Cluster w/ 17.21.0 #1635

onelapahead commented Oct 12, 2021

daroga0002 commented Oct 12, 2021

daroga0002 commented Oct 12, 2021

daroga0002 commented Oct 12, 2021

daroga0002 commented Oct 12, 2021

onelapahead commented Oct 12, 2021

stevehipwell commented Oct 12, 2021 •

edited

Loading

stevehipwell commented Oct 12, 2021

daroga0002 commented Oct 12, 2021 •

edited

Loading

stevehipwell commented Oct 12, 2021

stevehipwell commented Oct 12, 2021

daroga0002 commented Oct 13, 2021

stevehipwell commented Oct 13, 2021 •

edited

Loading

antonbabenko commented Oct 13, 2021

cmsmith7 commented Oct 13, 2021

stevehipwell commented Oct 13, 2021

rastakajakwanna commented Oct 13, 2021 •

edited

Loading

amahpour commented Oct 14, 2021

antonbabenko commented Oct 14, 2021

stevehipwell commented Oct 14, 2021

antonbabenko commented Oct 14, 2021

stevehipwell commented Oct 14, 2021

daroga0002 commented Oct 14, 2021

github-actions bot commented Nov 17, 2022

Cannot Create a New Cluster w/ 17.21.0 #1635

Cannot Create a New Cluster w/ 17.21.0 #1635

Comments

onelapahead commented Oct 12, 2021

Description

⚠️ Note

Versions

Reproduction

Code Snippet to Reproduce

Expected behavior

Actual behavior

Terminal Output Screenshot(s)

Additional context

daroga0002 commented Oct 12, 2021

daroga0002 commented Oct 12, 2021

daroga0002 commented Oct 12, 2021

daroga0002 commented Oct 12, 2021

onelapahead commented Oct 12, 2021

stevehipwell commented Oct 12, 2021 • edited Loading

stevehipwell commented Oct 12, 2021

daroga0002 commented Oct 12, 2021 • edited Loading

stevehipwell commented Oct 12, 2021

stevehipwell commented Oct 12, 2021

daroga0002 commented Oct 13, 2021

stevehipwell commented Oct 13, 2021 • edited Loading

antonbabenko commented Oct 13, 2021

cmsmith7 commented Oct 13, 2021

stevehipwell commented Oct 13, 2021

rastakajakwanna commented Oct 13, 2021 • edited Loading

amahpour commented Oct 14, 2021

antonbabenko commented Oct 14, 2021

stevehipwell commented Oct 14, 2021

antonbabenko commented Oct 14, 2021

stevehipwell commented Oct 14, 2021

daroga0002 commented Oct 14, 2021

github-actions bot commented Nov 17, 2022

stevehipwell commented Oct 12, 2021 •

edited

Loading

daroga0002 commented Oct 12, 2021 •

edited

Loading

stevehipwell commented Oct 13, 2021 •

edited

Loading

rastakajakwanna commented Oct 13, 2021 •

edited

Loading