Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot Create a New Cluster w/ 17.21.0 #1635

Closed
onelapahead opened this issue Oct 12, 2021 · 23 comments · Fixed by #1639
Closed

Cannot Create a New Cluster w/ 17.21.0 #1635

onelapahead opened this issue Oct 12, 2021 · 23 comments · Fixed by #1639
Assignees

Comments

@onelapahead
Copy link

Description

The module fails to create new clusters bc its trying to lookup an EKS cluster that does not exist by default.

⚠️ Note

Before you submit an issue, please perform the following first:

  1. Remove the local .terraform directory (! ONLY if state is stored remotely, which hopefully you are following that best practice!): rm -rf .terraform/
  2. Re-initialize the project root to pull down modules: terraform init
  3. Re-attempt your terraform plan or apply and check if the issue still persists

Versions

  • Terraform: v1.0.7
  • Provider(s):
  • hashicorp/aws v3.62.0
  • Module: v17.21.0

Reproduction

Steps to reproduce the behavior:

Attempt to create an EKS cluster w/ node groups. For reference I'm using region ap-southeast-2.

Code Snippet to Reproduce

module "eks" {
  source  = "terraform-aws-modules/eks/aws"
  cluster_name    = local.cluster_name
  cluster_version = "1.21"
  subnets         = module.vpc.private_subnets
  vpc_id          = module.vpc.vpc_id

  node_groups = {
    workers = {
      instance_types = [var.worker_instance_type]
      iam_role_arn = aws_iam_role.workers.arn
      desired_capacity = var.min_capacity
      max_capacity     = var.max_capacity
      min_capacity     = var.min_capacity

      launch_template_id      = aws_launch_template.default.id
      launch_template_version = aws_launch_template.default.default_version

      additional_tags = local.common_tags

      depends_on = [
        aws_iam_role_policy_attachment.workers-AmazonEKSWorkerNodePolicy,
        aws_iam_role_policy_attachment.workers-AmazonEKS_CNI_Policy,
        aws_iam_role_policy_attachment.workers-AmazonEC2ContainerRegistryReadOnly,
      ]
    }
  }
}

Expected behavior

I can use this module to create an EKS cluster w/ node groups

Actual behavior

It's failing to create a cluster w/ node groups bc it cannot find a default EKS cluster.

Terminal Output Screenshot(s)

Screen Shot 2021-10-12 at 12 01 25 PM

Additional context

see #1633 (comment)

@daroga0002
Copy link
Contributor

I am investigating this

@daroga0002
Copy link
Contributor

@bengesoff @hfuss I checked and seems issue is occuring not during cluster creation but rather during refresh phase when new cluster code is added to existing code.

Could you confirm?

@daroga0002
Copy link
Contributor

@stevehipwell this is related to data source eks_cluster which we introduced in your PR, this is used just to fetch

cluster_endpoint = data.aws_eks_cluster.default[0].endpoint
cluster_ca = data.aws_eks_cluster.default[0].certificate_authority[0].data

now I am looking into bootstrap.sh from AWS and seems those parameters are optional and can be fetched itself (if not setup) https://github.com/awslabs/amazon-eks-ami/blob/dbba9499841d3936d285bd2427f90ef0cdd385b3/files/bootstrap.sh#L20-L21

what now I thinking is just removing a data source and those two parameters from template.

Does there was any special reason why this should be fetched by terraform instead bootstrap.sh?

@daroga0002
Copy link
Contributor

Created draft PR to remove code causing issues, but this will require testing

@onelapahead
Copy link
Author

@daroga0002 Sounds like you're on it, but just to follow up - that's not what I am observing. I am starting with an empty Terraform state and attempting to create a new VPC and EKS cluster.

Before it even gets started with a terraform apply it errors out bc the new code tries to lookup an EKS cluster that does not exist.

The patch I opened in #1636 allowed me to create a brand new cluster, so your suggestion of possibly removing the data lookup of the default cluster makes sense to me from what I understand.

@stevehipwell
Copy link
Contributor

stevehipwell commented Oct 12, 2021

@hfuss #1636 only works because you're not using the functionality this enables and aren't setting var.create_eks = false, if either of those weren't the case you'd have errors.

@daroga0002 #1638 will only work for EKS optimised AMIs with bootstrap.sh and not for fully customised AMIs. What are the rules on internal module variable changes and the module SemVer? My original, and better tested, version passed local.cluster_endpoint & local.cluster_auth_base64 into the node_groups module; but I changed this to try and not change the variables in the sub module. I'll open a new PR to use this pattern. If we can't add a new mandatory sub module variable we could have "" defaults as this usage wouldn't have worked before anyway.

@stevehipwell
Copy link
Contributor

@daroga0002 I've created #1639 which should fix the issue and be a valid implementation of the pattern added in #1580.

@daroga0002
Copy link
Contributor

daroga0002 commented Oct 12, 2021

I thinked about this now again and in general currently we require CA and endpoint only when we using EKS optimized ami:

but as EKS optimized ami has bootstrap.sh script it can download those info itself.

From second side when user will want to use non eks optimized ami then he can always pass this data thru pre_userdata.

One more thing which I see is probably second bug in template or maybe I dont see use case but

means we will pass own AMI but we expecting always bootstrap.sh (so this is EKS optimized) and below we have
%{ if length(ami_id) > 0 && ami_is_eks_optimized ~}

@stevehipwell
Copy link
Contributor

@daroga0002 I'll try and answer your points below.

It wouldn't be easy, if even possible, to pass the dynamic endpoint and CA into userdata for a non EKS optimised AMIs, which is why I added them as variables into the userdata template to be consumed by the custom userdata embedded below.

You are correct that bootstrap.sh on an EKS optimised AMI can download these values itself, but I would rather they were passed in as happens by default to limit the moving parts and so nothing that can be generated in advance is left until runtime and done on each node. This is what we do for the other worker userdata so is consistent (MNGs do this too behind the scenes).

As far as the template goes; the first block either modifies bootstrap.sh if we're getting a merged userdata because we didn't pass in an AMI ID, or it sets the variables without needing the modify bootstrap.sh to source them from another file. Then we add the custom userdata. Finally we call bootstrap.sh if we have an EKS optimised AMI that isn't going to merge the userdata. If you have a non-EKS optimised AMI your custom userdata needs to replicate bootstrap.sh or your node wont join the cluster.

@stevehipwell
Copy link
Contributor

To clarify the basic principals of the changes in #1580. Firstly given 2 MNGs with the only difference being that one set an AMI ID, the resulting userdata would be functionally the same; so if the passed in AMI ID is the current default then you get the same outcome as if you hadn't passed it in. Secondly if it's not an EKS optimised AMI the required variables should be available to the custom bootstrap script. With this pattern I could take the current AWS EKS AMI ID and pass it into the module while setting it as not EKS optimised and then put a modified version of bootstrap.sh into the custom userdata to fix bugs such as awslabs/amazon-eks-ami#782 (at the cost of needing to manually update the AMI ID).

@daroga0002
Copy link
Contributor

@antonbabenko seems this is some bug which was not catched during testing as tests were done from scratch. This is isuse which is happenig during refreshing phase of terraform. I am thinking now with @stevehipwell in what way we can address this problem in best way.

QUestion is does we should mark somehow GitHub release as buggy in Changelog?

@stevehipwell
Copy link
Contributor

stevehipwell commented Oct 13, 2021

@daroga0002 off the top of my head the following solutions to the buggy release are possible but not necessarily the only ones.

  • Delete the v17.21.0 tag
  • Replace the v17.21.0 tag after the code is fixed
  • Release the fix as v17.21.1 or v17.22.0 and update the release docs to show v17.21.0 is buggy

@antonbabenko
Copy link
Member

@daroga0002 I am not sure what is the problem in detail because I was not following this discussion.

We need to release a bug fix as a minor release (17.22.0) and no need to highlight/mark something as buggy.

@cmsmith7
Copy link

Hi we are also now in a predicament whereby we are unable to deploy new clusters .. typically the update happened during some of our upgrades and testing. Do you have an idea when we could expect this to be fixed? Many thanks.

@stevehipwell
Copy link
Contributor

@cmsmith7 can you not just use the previous version?

@rastakajakwanna
Copy link

rastakajakwanna commented Oct 13, 2021

17.21.0 effectively blocks creating new clusters with managed node groups because eks_create is true by default (and that's correct because one wants to create the resources unless set otherwise).
As the count on the data source says if the above variable is true, then create the data resource, there is no conditional to actually depend on cluster creation (which it cannot because data resources have precedence), the data resource fails to gather information from not yet existing control plane and therefore it blocks new cluster creation.

v17.20.0 of the eks module is not blocking new cluster creation and #1639 looks like a proper fix to me.

@amahpour
Copy link

FYI, as a beginner to intermediate TF user I spent ~5 hours today trying to debug this issue. I, mistakenly, did not specify a version number when pulling this module (and verified that rolling back to 17.20.0 works perfectly). Additionally, this issue page hasn't been indexed yet on Google so searching for the output error will get you meek results (some referring to githubmemory.com which eventually got me here).

Just thought I'd add my two cents so we can make this PR a priority...

@antonbabenko
Copy link
Member

@daroga0002 @stevehipwell Do you guys know when the fix will be ready?

@stevehipwell
Copy link
Contributor

@antonbabenko I think @daroga0002 is just waiting on your review #1639 (review).

@antonbabenko
Copy link
Member

@stevehipwell I have just merged it and released v17.22.0.

During the night I received even more emails from GitHub and didn't have a chance to read the one where Dawid mentioned me. Thank you very much for your contribution!

@stevehipwell
Copy link
Contributor

@antonbabenko this one was on me so I'm glad I could help get it fixed. I'm still slightly unclear as to why using the data source failed in the way it did, but in hindsight the current solution was the one that I should have stuck with as it's significantly simpler to pass variables rather than re-fetch them.

@daroga0002
Copy link
Contributor

@antonbabenko this one was on me so I'm glad I could help get it fixed. I'm still slightly unclear as to why using the data source failed in the way it did, but in hindsight the current solution was the one that I should have stuck with as it's significantly simpler to pass variables rather than re-fetch them.

As per me this is some terraform bug with dependencies during refresh phase as I also tried to setup that data source with explicit dependency defining somthing like:

depends_on = [var.ng_depends_on]

but it was still failing

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 17, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
7 participants